超友好的TCGA数据下载方式!

admin 29 2025-01-20 编辑

TCGA数据库可以说是研究分析中必不可少的一部分,数据目前在官网的下载方式虽说不难,但是整合起来还是叫人头大!不善于编程的小编被他整整折磨了一天也可以说是毫无进展!!!在摔电脑的边缘疯狂试探......

可是小编不能放弃,在崩溃的同时到处寻找解决办法,于是找到了一个R包--TCGAbiolinks,它是GDC官方推荐了一款第三方工具,通过GDC官方API下载数据,保证数据的及时性和准确性,同时也提供数据整理、聚类分析、差异分析、富集分析等功能。看上去还不错,小编就自己对下载数据初步尝试了一番~

首先是TCGAbiolinks的安装和加载,TCGAbiolinks对于R的版本要求较高,建议在3.4以上的版本进行

#安装source("https://bioc.ism.ac.jp/biocLite.R")biocLite("TCGAbiolinks")#加载library(TCGAbiolinks)

1、表达谱数据

#可以下载三种形式的数据,如"HTSeq - Counts","HTSeq - FPKM-UQ","HTSeq - FPKM"query <- GDCquery(project = "TCGA-GBM",##对应癌症                  data.category = "Transcriptome Profiling",                  data.type = "Gene Expression Quantification",                   workflow.type = "HTSeq - FPKM-UQ"##对应数据形式"HTSeq - Counts","HTSeq - FPKM"                  )GDCdownload(query)data <- GDCprepare(query)

2、甲基化数据

#"Illumina Human Methylation 450","Illumina Human Methylation 27"query<- GDCquery(project = "TCGA-GBM",                 legacy = TRUE,data.category = "DNA methylation",                 platform ="Illumina Human Methylation 450")GDCdownload(query)data<-GDCprepare(query)#甲基化idat文件query <- GDCquery(project = "TCGA-GBM",                  data.category = "Raw microarray data",                  data.type = "Raw intensities",                   experimental.strategy = "Methylation array",                   legacy = TRUE,                  file.type = ".idat",                  platform = "Illumina Human Methylation 450")GDCdownload(query)data<-GDCprepare(query)

3、miRNA

query = GDCquery(project = "TCGA-GBM",                  data.category = "Transcriptome Profiling",                 data.type = "miRNA Expression Quantification") GDCdownload(query)data<-GDCprepare(query)

4、拷贝数变异

query <- GDCquery(project = "TCGA-GBM",                   data.category = "Copy Number Variation",                   data.type = "Copy Number Segment")GDCdownload(query)data<-GDCprepare(query)

5、临床数据

clinical <- GDCquery_clinic(project = "TCGA-GBM",                             type = "Clinical",                            save.csv=TRUE##可以直接写出文件                            )select<-c("submitter_id","gender","year_of_birth","days_to_death",          "vital_status","tumor_grade","tumor_stage")##可以根据列名选择部分输出clinical_select<-clinical[,select]write.table(clinical_select,file = "GBM_clinical.txt",sep="\t",row.names=FALSE)

对于表达谱、甲基化谱、miRNA、拷贝数变异数据,通过上述的操作都可以获得data进行后续分析,当然,我们也可以把这些数据进行保存

#以表达谱为例进行演示,其他同理library(TCGAbiolinks)library(SummarizedExperiment)library(stringr)setwd("D:/gdc/")#设置工作路径query <- GDCquery(project = "TCGA-GBM",                  data.category = "Transcriptome Profiling",                  data.type = "Gene Expression Quantification",                   workflow.type = "HTSeq - FPKM-UQ")GDCdownload(query)expdat <- GDCprepare(query)matrix=assay(expdat)namecol<-substring(colnames(expdat),1,16)#将"TCGA-14-0736-02A-01R-2005-01"转化成"TCGA-14-0736-02A"这样的形式colnames(matrix)<-namecolwrite.table(matrix,file = "GBM_expression_FPKM-UQ.txt",sep="\t")#输出文件

这样我们就可以得到类似下图形式的数据

TCGAbiolinks还可以进行对数据进一步的分析,比如差异分析、富集分析等,待小编深入学习下再来介绍~

2019年,遇见更好的自己

超友好的TCGA数据下载方式!

上一篇: 质粒构建工具推荐,实验室必备的分子克隆利器
下一篇: 癌症标志物界的后起之秀了解一下?
相关文章