[R]TCGAbiolinks包:數據準備--query、download、prepare

TCGAbiolinks包是一站式分析TCGA數據的R包工具,它集成了TCGA數據下載、分析、可視化的全部流程。此次系列筆記主要跟著 TCGAbiolinks幫助文檔重新學習下TCGA數據挖掘流程。

一、查找感興趣的TCGA數據

  • GDCquery()
GDCquery(
  project,
  data.category,
  data.type,
  workflow.type,
  legacy = FALSE,
  access,
  platform,
  file.type,
  barcode,
  data.format,
  experimental.strategy,
  sample.type
)

1、可設置的參數

1.1、根據腫瘤類型

  • project參數:指定一個或多個感興趣的TCGA項目名
  • 如下代碼所示,供包括33種TCGA癌癥類型
projects = TCGAbiolinks:::getGDCprojects()$project_id
TCGAs = grep("TCGA", projects, value = T)
sort(TCGAs)
# [1] "TCGA-ACC"  "TCGA-BLCA" "TCGA-BRCA" "TCGA-CESC" "TCGA-CHOL" "TCGA-COAD"
# [7] "TCGA-DLBC" "TCGA-ESCA" "TCGA-GBM"  "TCGA-HNSC" "TCGA-KICH" "TCGA-KIRC"
# [13] "TCGA-KIRP" "TCGA-LAML" "TCGA-LGG"  "TCGA-LIHC" "TCGA-LUAD" "TCGA-LUSC"
# [19] "TCGA-MESO" "TCGA-OV"   "TCGA-PAAD" "TCGA-PCPG" "TCGA-PRAD" "TCGA-READ"
# [25] "TCGA-SARC" "TCGA-SKCM" "TCGA-STAD" "TCGA-TGCT" "TCGA-THCA" "TCGA-THYM"
# [31] "TCGA-UCEC" "TCGA-UCS"  "TCGA-UVM" 
Study Abbreviation Study Name 中文名
ACC Adrenocortical carcinoma 腎上腺皮質癌
BLCA Bladder Urothelial Carcinoma 膀胱尿路上皮癌
BRCA Breast invasive carcinoma 浸潤性乳腺癌
CESC Cervical squamous cell carcinoma and endocervical adenocarcinoma 宮頸鱗狀細胞癌和宮頸內腺癌
CHOL Cholangiocarcinoma 膽管癌
COAD Colon adenocarcinoma 結腸腺癌
DLBC Lymphoid Neoplasm Diffuse Large B-cell Lymphoma 淋巴樣腫瘤彌漫大b細胞淋巴瘤
ESCA Esophageal carcinoma 食管癌癌
GBM Glioblastoma multiforme 多形性成膠質細胞瘤
HNSC Head and Neck squamous cell carcinoma 頭頸部鱗狀細胞癌
KICH Kidney Chromophobe 腎嫌色細胞癌
KIRC Kidney renal clear cell carcinoma 腎透明細胞癌
KIRP Kidney renal papillary cell carcinoma 腎乳頭狀細胞癌
LAML Acute Myeloid Leukemia 急性髓系白血病
LGG Brain Lower Grade Glioma 腦低級別膠質瘤
LIHC Liver hepatocellular carcinoma 肝臟肝細胞癌
LUAD Lung adenocarcinoma 肺腺癌
LUSC Lung squamous cell carcinoma 肺鱗癌
MESO Mesothelioma 間皮瘤
OV Ovarian serous cystadenocarcinoma 卵巢漿液性囊腺癌
PAAD Pancreatic adenocarcinoma 胰腺腺癌
PCPG Pheochromocytoma and Paraganglioma 嗜鉻細胞瘤和副神經節瘤
PRAD Prostate adenocarcinoma 前列腺腺癌
READ Rectum adenocarcinoma 直腸腺癌
SARC Sarcoma 肉瘤
SKCM Skin Cutaneous Melanoma 皮膚皮膚黑色素瘤
STAD Stomach adenocarcinoma 胃腺癌
TGCT Testicular Germ Cell Tumors 睪丸生殖細胞腫瘤
THCA Thyroid carcinoma 甲狀腺癌
THYM Thymoma 胸腺瘤
UCEC Uterine Corpus Endometrial Carcinoma 子宮內膜癌
UCS Uterine Carcinosarcoma 子宮癌肉瘤
UVM Uveal Melanoma 葡萄膜黑色素瘤

1.2 hg19/hg38

  • 主要根據參考基因組的不同,包含兩套數據:GDC Legacy Archive【主要GRCh37 (hg19)】,GDC harmonized database【GRCh38 (hg38)】
  • 通過設置參數legacy ,默認為FALSE(hg19);TRUE則表示使用hg38參考基因組的測序數據。

1.3 下載數據類型

基于上述的參數,我們可以設置如下參數,交代我們的目標數據類型

  • data.category = 指定下載什么類型的數據:如組學數據、臨床數據....
#查看某一種腫瘤所包含的數據類型
TCGAbiolinks:::getProjectSummary("TCGA-BRCA")$data_categories
#   file_count case_count               data_category
# 1       4679       1098            Sequencing Reads
# 2       1183       1098                    Clinical
# 3       6627       1098       Copy Number Variation
# 4       5315       1098                 Biospecimen
# 5       1234       1095             DNA Methylation
# 6       6080       1097     Transcriptome Profiling
# 7       8648       1044 Simple Nucleotide Variation
  • data.type = 更加細節的數據類型選擇(optional)
  • workflow.type = 同一個測序數據可能有不同的pipeline處理流程(optional, for harmonized )
  • platform = 測序平臺(optional)
  • file.type = 具體的數據文件(optional, for legacy)
    如果不知道目標數據的上述信息,可以參考下面的概述
GDC harmonized database
Data.category Data.type Workflow.Type Platform
Transcriptome Profiling Gene Expression Quantification HTSeq - Counts
Transcriptome Profiling Gene Expression Quantification HTSeq - FPKM
Transcriptome Profiling Gene Expression Quantification HTSeq - FPKM-UQ
Transcriptome Profiling Gene Expression Quantification STAR - Counts
Transcriptome Profiling Isoform Expression Quantification -
Transcriptome Profiling miRNA Expression Quantification -
Transcriptome Profiling Splice Junction Quantification
Copy number variation Copy Number Segment
Copy number variation Masked Copy Number Segment
Copy number variation Gene Level Copy Number Scores
Simple Nucleotide Variation Masked Somatic Mutation MuSE Variant Aggregation and Masking
Simple Nucleotide Variation Masked Somatic Mutation MuTect2 Variant Aggregation and Masking
Simple Nucleotide Variation Masked Somatic Mutation SomaticSniper Variant Aggregation and Masking
Simple Nucleotide Variation Masked Somatic Mutation VarScan2 Variant Aggregation and Masking
Raw Sequencing Data -
Biospecimen Slide Image
Biospecimen Biospecimen Supplement
Clinical -
DNA Methylation Methylation Beta Value Illumina Human Methylation 450
DNA Methylation Methylation Beta Value Illumina Human Methylation 27
GDC Legacy Archive
Data.category Data.type Platform file.type
Copy number variation - Affymetrix SNP Array 6.0 nocnv_hg18.seg
Copy number variation - Affymetrix SNP Array 6.0 hg18.seg
Copy number variation - Affymetrix SNP Array 6.0 nocnv_hg19.seg
Copy number variation - Affymetrix SNP Array 6.0 hg19.seg
Copy number variation - Illumina HiSeq -
Simple nucleotide variation Simple somatic mutation
Raw sequencing data
Biospecimen
Clinical
Protein expression MDA RPPA Core -
Gene expression Gene expression quantification Illumina HiSeq normalized_results
Gene expression Gene expression quantification Illumina HiSeq results
Gene expression Gene expression quantification HT_HG-U133A -
Gene expression Gene expression quantification AgilentG4502A_07_2 -
Gene expression Gene expression quantification AgilentG4502A_07_1 -
Gene expression Gene expression quantification HuEx-1_0-st-v2 FIRMA.txt
Gene expression Gene expression quantification gene.txt
Gene expression Isoform expression quantification - -
Gene expression miRNA gene quantification - hg19.mirna
Gene expression miRNA gene quantification hg19.mirbase20
Gene expression miRNA gene quantification mirna
Gene expression Exon junction quantification - -
Gene expression Exon quantification - -
Gene expression miRNA isoform quantification - hg19.isoform
Gene expression miRNA isoform quantification - isoform
DNA methylation Illumina Human Methylation 450 Not used
DNA methylation Illumina Human Methylation 27 Not used
DNA methylation Illumina DNA Methylation OMA003 CPI Not used
DNA methylation Illumina DNA Methylation OMA002 CPI Not used
DNA methylation Illumina Hi Seq
DNA methylation Bisulfite sequence alignment
DNA methylation Methylation percentage
DNA methylation Aligned reads
Raw microarray data Raw intensities Illumina Human Methylation 450 idat
Raw Microarray Data Raw intensities Illumina Human Methylation 27 idat
Structural Rearrangement
Other

1.4 樣本標簽Barcode

完整的barcode:形如 TCGA-G4-6317-02A-11D-2064-05,這個標簽包含了從病人來源到測序過程、分析的所有信息,如下圖所示比較重要的是ParticipantSamplePortion三個部分,分別交代了病人編號、樣本類型、測序類型
病人的id:形如 TCGA-G4-6317
樣本來源的id:形如 TCGA-G4-6317-02

  • 其中比較重要的是交代樣本類型的Sample的兩位數信息,是后面進行差異分析的分組依據。具體對應的含義如下。例如01表示病人的原位瘤組織;11表示來自病人的正常組織....

  • 基于上述理解,我們也可以設置sample.type =參數指定下載感興趣的樣本類型數據,例如sample.type = "Primary Tumor"

  • 對于給定的TCGA barcode,可以利用TCGAquery_SampleTypes()提取出目標分組的樣本;TCGAquery_MatchedCoupledSampleTypes()函數可以提取來自同一病人的配對樣本數據。

query <- GDCquery(project = c("TCGA-BRCA"),
                  legacy = FALSE, #default(GDC harmonized database)
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  workflow.type = "HTSeq - Counts")
dim(getResults(query))
#[1] 1222   29
query_info = getResults(query)
TP = TCGAquery_SampleTypes(query_info$sample.submitter_id,"TP")
NT = TCGAquery_SampleTypes(query_info$sample.submitter_id,"NT")
query <- GDCquery(project = c("TCGA-BRCA"),
                  legacy = FALSE, #default(GDC harmonized database)
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  workflow.type = "HTSeq - Counts",
                  barcode = c(TP, NT))
dim(getResults(query))
#[1] 1215   29

Pair_sample = TCGAquery_MatchedCoupledSampleTypes(query_info$sample.submitter_id,c("NT","TP"))
query <- GDCquery(project = c("TCGA-BRCA"),
                  legacy = FALSE, #default(GDC harmonized database)
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  workflow.type = "HTSeq - Counts",
                  barcode = Pair_sample)
dim(getResults(query))
#[1] 229  29

如上是查詢TCGA目標數據的幾種常見標準,還有幾個參數沒有介紹,可參看函數幫助文檔。可根據自己的目的靈活設置上述參數。

2、query示例

2.1 膽管癌轉錄組數據 | hg19 | 所有樣本

TCGAbiolinks:::getProjectSummary("TCGA-CHOL",legacy = TRUE)$data_categories
#   file_count case_count               data_category
# 1         30         30          Protein expression
# 2        680         36       Copy number variation
# 3         51         51                 Biospecimen
# 4        444         36 Simple nucleotide variation
# 5        450         36             Gene expression
# 6        686         36         Raw microarray data
# 7         45         36             DNA methylation
# 8        193         51                    Clinical
# 9        365         51         Raw sequencing data
query <- GDCquery(project = "TCGA-CHOL",
                  legacy = TRUE,
                  data.category = "Gene expression",
                  data.type = "Gene expression quantification",
                  platform = "Illumina HiSeq", 
                  file.type  = "normalized_results")
dim(getResults(query))
#[1] 45 32
t(getResults(query)[1,])
#                       1                                                                                   
# id                    "34216957-50e3-434c-8c38-72f0f2ddcf16"                                              
# data_format           "TXT"                                                                               
# access                "open"                                                                              
# cases                 "TCGA-3X-AAV9-01A-72R-A41I-07"                                                      
# file_name             "unc.edu.59012a78-0e8f-4b99-af97-0dbb1d3d0513.2538862.rsem.genes.normalized_results"
# submitter_id          NA                                                                                  
# data_category         "Gene expression"                                                                   
# type                  "file"                                                                              
# file_size             437196                                                                              
# platform              "Illumina HiSeq"                                                                    
# state_comment         NA                                                                                  
# tags                  character,3                                                                         
# updated_datetime      "2017-03-05T10:11:44.298823-06:00"                                                  
# md5sum                "23836c9f9bdb053c567d91a67b62159d"                                                  
# file_id               "34216957-50e3-434c-8c38-72f0f2ddcf16"                                              
# data_type             "Gene expression quantification"                                                    
# state                 "live"                                                                              
# experimental_strategy "RNA-Seq"                                                                           
# file_state            "submitted"                                                                         
# version               "1"                                                                                 
# data_release          "0.0 - 29.0"                                                                        
# project               "TCGA-CHOL"                                                                         
# center_id             "ee7a85b3-8177-5d60-a10c-51180eb9009c"                                              
# center_center_type    "CGCC"                                                                              
# center_code           "07"                                                                                
# center_name           "University of North Carolina"                                                      
# center_namespace      "unc.edu"                                                                           
# center_short_name     "UNC"                                                                               
# sample_type           "Primary Tumor"                                                                     
# is_ffpe               FALSE                                                                               
# cases.submitter_id    "TCGA-3X-AAV9"                                                                      
# sample.submitter_id   "TCGA-3X-AAV9-01A"

2.2 肺腺癌癌轉錄組數據 | hg38 | 原位瘤+正常組織

TCGAbiolinks:::getProjectSummary("TCGA-LUAD",legacy = FALSE)$data_categories
# 4       2916        519     Transcriptome Profiling
query <- GDCquery(project = "TCGA-LUAD",
                  legacy = FALSE,
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  workflow.type = "HTSeq - Counts")
dim(getResults(query))
#[1] 594  29

2.3 乳腺癌甲基化數據 | hg19 | Illumina Human Methylation 450平臺

TCGAbiolinks:::getProjectSummary("TCGA-BRCA",legacy = TRUE)$data_categories
#7       1250       1097             DNA methylation
query <- GDCquery(project = "TCGA-BRCA",
                  legacy = TRUE,
                  data.category = "DNA methylation",
                  platform = "Illumina Human Methylation 450")
dim(getResults(query))
#[1] 895  32

二、根據選擇的query,下載數據

  • GDCdownload()函數使用比較簡單,指定我們上一步得到的query即可。
  • 提供兩種下載方式:apiclient,前者較快,但有時不太穩定;后者較慢。推薦使用api方式(default),當下載大文件時,可設置files.per.chunk = n,表示分批下載,每批下載n個病人的數據,可避免中途報錯,而前功盡棄。
  • directory表示下載到哪個文件夾,默認會創建、下載到GDCdata文件夾
GDCdownload(
  query,
  token.file,
  method = "api",
  directory = "GDCdata",
  files.per.chunk = NULL
)
  • 示例數據
query <- GDCquery(project = "TCGA-CHOL",
                  legacy = TRUE,
                  data.category = "Gene expression",
                  data.type = "Gene expression quantification",
                  platform = "Illumina HiSeq", 
                  file.type  = "normalized_results")
GDCdownload(query, files.per.chunk = 10)
# Downloading data for project TCGA-CHOL
# GDCdownload will download 45 files. A total of 19.580796 MB
# Downloading chunk 1 of 5 (10 files, size = 4.351703 MB) as Wed_Aug_18_21_52_08_2021_0.tar.gz
# Downloading: 1.9 MB     Downloading chunk 2 of 5 (10 files, size = 4.350318 MB) as Wed_Aug_18_21_52_08_2021_1.tar.gz
# Downloading: 1.8 MB     Downloading chunk 3 of 5 (10 files, size = 4.351067 MB) as Wed_Aug_18_21_52_08_2021_2.tar.gz
# Downloading: 1.8 MB     Downloading chunk 4 of 5 (10 files, size = 4.353528 MB) as Wed_Aug_18_21_52_08_2021_3.tar.gz
# Downloading: 1.9 MB     Downloading chunk 5 of 5 (5 files, size = 2.17418 MB) as Wed_Aug_18_21_52_08_2021_4.tar.gz
# Downloading: 900 kB

三、讀取已經下載到本地的文件到當前環境

  • GDCprepare()會根據我們提供的query對象,以及下載數據的儲存目錄(默認也是GDCdata文件夾),完成數據讀取的操作,以SummarizedExperiment格式展示。
  • 還可設置save = TRUEfilename = ****參數,在讀取數據后,自動將SummarizedExperiment對象保存為Rdata,以供之后方便調用(defalut
    為FALSE)
query <- GDCquery(project = "TCGA-CHOL",
                  legacy = TRUE,
                  data.category = "Gene expression",
                  data.type = "Gene expression quantification",
                  platform = "Illumina HiSeq", 
                  file.type  = "normalized_results")
GDCdownload(query, files.per.chunk = 10)
data <- GDCprepare(query, save = T, save.filename = "CHOL_RNAseq.rda")
# -------------------
#   oo Reading 45 files
# -------------------
#   |=================================================|100%                      Completed after 0 s 
# -------------------
#   oo Merging 45 files
# -------------------
#   Starting to add information to samples
# => Add clinical information to samples
# => Adding TCGA molecular information from marker papers
# => Information will have prefix 'paper_' 
# chol subtype information from:doi:10.1016/j.celrep.2017.02.033
# => Saving file: CHOL_RNAseq.rda
# => File saved

  • GDCprepare()在讀取數據的過程中,會自動進行樣本信息、基因信息的注釋。但目前這還不能支持全部類型數據。
library(SummarizedExperiment)
#表達矩陣信息
dim(assay(data))
#[1] 19947    45
assays(data)
# List of length 1
# names(1): normalized_count
assay(data, "normalized_count")[1:4,1:4]
#       TCGA-3X-AAV9-01A-72R-A41I-07 TCGA-3X-AAVC-01A-21R-A41I-07 TCGA-W5-AA2R-11A-11R-A41I-07 TCGA-ZH-A8Y4-01A-11R-A41I-07
# A1BG                      70.9581                      29.9768                  108409.2249                    1485.0630
# A2M                    23986.2548                    8129.6961                   98095.2358                    7119.1570
# NAT1                      72.4007                      52.8682                     160.2275                      76.5504
# NAT2                       8.7099                       0.0000                    1472.3868                      23.2558

#樣本(臨床)信息
dim(colData(data))
#[1]  45 205
colData(data)[1:4,1:4]
# DataFrame with 4 rows and 4 columns
#                                         barcode      patient           sample shortLetterCode
#                                         <character>  <character>      <character>     <character>
# TCGA-3X-AAV9-01A-72R-A41I-07 TCGA-3X-AAV9-01A-72R.. TCGA-3X-AAV9 TCGA-3X-AAV9-01A              TP
# TCGA-3X-AAVC-01A-21R-A41I-07 TCGA-3X-AAVC-01A-21R.. TCGA-3X-AAVC TCGA-3X-AAVC-01A              TP
# TCGA-W5-AA2R-11A-11R-A41I-07 TCGA-W5-AA2R-11A-11R.. TCGA-W5-AA2R TCGA-W5-AA2R-11A              NT
# TCGA-ZH-A8Y4-01A-11R-A41I-07 TCGA-ZH-A8Y4-01A-11R.. TCGA-ZH-A8Y4 TCGA-ZH-A8Y4-01A              TP

#不同的基因ID類型
dim(rowData(data))
#[1] 19947     3
rowData(data)[1:6,1:3]
# DataFrame with 6 rows and 3 columns
#                   gene_id entrezgene ensembl_gene_id
#                   <character>  <integer>     <character>
# A1BG                 A1BG          1 ENSG00000121410
# A2M                   A2M          2 ENSG00000175899
# NAT1                 NAT1          9 ENSG00000171428
# NAT2                 NAT2         10 ENSG00000156006
# RP11-986E7.7 RP11-986E7.7         12 ENSG00000273259
# AADAC               AADAC         13 ENSG00000114771


#基因的坐標信息
rowRanges(data)
# GRanges object with 19947 ranges and 3 metadata columns:
#           seqnames              ranges strand |      gene_id entrezgene ensembl_gene_id
#         <Rle>           <IRanges>  <Rle> |  <character>  <integer>     <character>
# A1BG    chr19   58856544-58864865      - |         A1BG          1 ENSG00000121410
# A2M    chr12     9220260-9268825      - |          A2M          2 ENSG00000175899
# NAT1     chr8   18027986-18081198      + |         NAT1          9 ENSG00000171428
# NAT2     chr8   18248755-18258728      + |         NAT2         10 ENSG00000156006
# RP11-986E7.7    chr14   95058395-95090983      + | RP11-986E7.7         12 ENSG00000273259
# ...      ...                 ...    ... .          ...        ...             ...
# RASAL2-AS1     chr1 178060643-178063119      - |   RASAL2-AS1  100302401 ENSG00000224687
# LINC00882     chr3 106555658-106959488      - |    LINC00882  100302640 ENSG00000242759
# FTX     chrX   73183790-73513409      - |          FTX  100302692 ENSG00000230590
# TICAM2     chr5 114914339-114961876      - |       TICAM2  100302736 ENSG00000243414
# SLC25A5-AS1     chrX 118599997-118603061      - |  SLC25A5-AS1  100303728 ENSG00000224281
# -------
# seqinfo: 24 sequences from an unspecified genome; no seqlengths

以上就是查找數據,下載數據,讀取數據的全部流程,接下來就可以開始分析數據了~

補充:關于病人的臨床數據與腫瘤分型

1、獲取病人的臨床數據

  • 如上在GDCprepare()過程中,會自動注釋病人樣本的臨床信息。
  • 我們也可以預先單獨下載每個病人的臨床數據,以供參考。
方法一:GDCquery() pipeline
query <- GDCquery(project = "TCGA-ACC", 
                  data.category = "Clinical",
                  data.type = "Clinical Supplement", 
                  data.format = "BCR Biotab")
GDCdownload(query, files.per.chunk = 20)
clinical.BCRtab.all <- GDCprepare(query)


grep("clinical_", names(clinical.BCRtab.all), value = T)
# [1] "clinical_drug_brca"               "clinical_omf_v4.0_brca"          
# [3] "clinical_follow_up_v4.0_brca"     "clinical_follow_up_v1.5_brca"    
# [5] "clinical_follow_up_v4.0_nte_brca" "clinical_patient_brca"           
# [7] "clinical_radiation_brca"          "clinical_nte_brca"               
# [9] "clinical_follow_up_v2.1_brca" 
clinical_patient_brca = as.data.frame(clinical.BCRtab.all$clinical_patient_brca)
clinical_patient_brca[1:4,1:4]
#                       bcr_patient_uuid bcr_patient_barcode form_completion_date                  prospective_collection
# 1                     bcr_patient_uuid bcr_patient_barcode form_completion_date tissue_prospective_collection_indicator
# 2                              CDE_ID:      CDE_ID:2003301              CDE_ID:                          CDE_ID:3088492
# 3 6E7D5EC6-A469-467C-B748-237353C23416        TCGA-3C-AAAU            2014-1-13                                      NO
# 4 55262FCB-1B01-4480-B322-36570430C917        TCGA-3C-AALI            2014-7-28                                      NO
方法二:GDCquery_clinic()
  • 根據官方介紹,這個函數下載的是indexed clinical: a refined clinical data that is created using the XML files(方法一).
  • 這種方法下載速度較快,建議優先使用。如果沒有想要的信息,再使用方法一。
clinical <- GDCquery_clinic(project = "TCGA-BRCA", type = "clinical")
clinical <- GDCquery_clinic(project = "TCGA-BRCA", type = "clinical")
clinical[1:4,1:4]
#   submitter_id synchronous_malignancy ajcc_pathologic_stage tumor_stage
# 1 TCGA-E2-A14U                     No               Stage I     stage i
# 2 TCGA-E9-A1RC                     No            Stage IIIC  stage iiic
# 3 TCGA-D8-A1J9                     No              Stage IA    stage ia
# 4 TCGA-E2-A14P                     No            Stage IIIC  stage iiic

2、獲取病人的腫瘤分型

  • PanCancerAtlas_subtypes()
    The columns “Subtype_Selected” was selected as most prominent subtype classification (from the other columns)
subtypes <- PanCancerAtlas_subtypes()
dim(subtypes)
#[1] 7734   10
table(subtypes$cancer.type)
# ACC  AML BLCA BRCA COAD ESCA  GBM HNSC KICH KIRC KIRP  LGG LIHC LUAD LUSC OVCA PCPG 
# 91  187  129 1218  341  169  606  279   66  442  161  516  196  230  178  489  178 
# PRAD READ SKCM STAD THCA UCEC  UCS 
# 333  118  333  383  496  538   57
head(as.data.frame(subtypes))
#   pan.samplesID cancer.type                         Subtype_mRNA   Subtype_DNAmeth Subtype_protein Subtype_miRNA Subtype_CNA Subtype_Integrative Subtype_other      Subtype_Selected
# 1  TCGA-OR-A5J1         ACC steroid-phenotype-high+proliferation         CIMP-high              NA       miRNA_1       Quiet                COC3           C1A         ACC.CIMP-high
# 2  TCGA-OR-A5J2         ACC steroid-phenotype-high+proliferation          CIMP-low               1       miRNA_1       Noisy                COC3           C1A          ACC.CIMP-low
# 3  TCGA-OR-A5J3         ACC               steroid-phenotype-high CIMP-intermediate               3       miRNA_6 Chromosomal                COC2           C1A ACC.CIMP-intermediate
# 4  TCGA-OR-A5J4         ACC                                 <NA>         CIMP-high              NA       miRNA_6 Chromosomal                <NA>          <NA>         ACC.CIMP-high
# 5  TCGA-OR-A5J5         ACC               steroid-phenotype-high CIMP-intermediate              NA       miRNA_2 Chromosomal                COC2           C1A ACC.CIMP-intermediate
# 6  TCGA-OR-A5J6         ACC                steroid-phenotype-low          CIMP-low               2       miRNA_1       Noisy                COC1           C1B          ACC.CIMP-low
  • TCGAquery_subtype()
    These subtypes will be automatically added in the summarizedExperiment object through GDCprepare. But you can also use the TCGAquery_subtype function to retrieve this information.
brca.subtype <- TCGAquery_subtype(tumor = "brca")
t(brca.subtype[1,])
#                                     [,1]          
# patient                             "TCGA-3C-AAAU"
# Tumor.Type                          "BRCA"        
# Included_in_previous_marker_papers  "NO"          
# vital_status                        "Alive"       
# days_to_birth                       "-20211"      
# days_to_death                       "NA"          
# days_to_last_followup               "4047"        
# age_at_initial_pathologic_diagnosis "55"          
# pathologic_stage                    "NA"          
# Tumor_Grade                         "NA"          
# BRCA_Pathology                      "NA"          
# BRCA_Subtype_PAM50                  "LumA"        
# MSI_status                          "NA"          
# HPV_Status                          "NA"          
# tobacco_smoking_history             "NA"          
# CNV Clusters                        "C6"          
# Mutation Clusters                   "C7"          
# DNA.Methylation Clusters            "C1"          
# mRNA Clusters                       "C1"          
# miRNA Clusters                      "C3"          
# lncRNA Clusters                     "NA"          
# Protein Clusters                    "NA"          
# PARADIGM Clusters                   "C5"          
# Pan-Gyn Clusters                    "NA"

GDCquery_Maf()函數可以支持下載突變數據,這里就暫時不學習了。之后有機會再了解一下。

最后編輯于
?著作權歸作者所有,轉載或內容合作請聯系作者
  • 序言:七十年代末,一起剝皮案震驚了整個濱河市,隨后出現的幾起案子,更是在濱河造成了極大的恐慌,老刑警劉巖,帶你破解...
    沈念sama閱讀 227,702評論 6 531
  • 序言:濱河連續發生了三起死亡事件,死亡現場離奇詭異,居然都是意外死亡,警方通過查閱死者的電腦和手機,發現死者居然都...
    沈念sama閱讀 98,143評論 3 415
  • 文/潘曉璐 我一進店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人,你說我怎么就攤上這事。” “怎么了?”我有些...
    開封第一講書人閱讀 175,553評論 0 373
  • 文/不壞的土叔 我叫張陵,是天一觀的道長。 經常有香客問我,道長,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 62,620評論 1 307
  • 正文 為了忘掉前任,我火速辦了婚禮,結果婚禮上,老公的妹妹穿的比我還像新娘。我一直安慰自己,他們只是感情好,可當我...
    茶點故事閱讀 71,416評論 6 405
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著,像睡著了一般。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發上,一...
    開封第一講書人閱讀 54,940評論 1 321
  • 那天,我揣著相機與錄音,去河邊找鬼。 笑死,一個胖子當著我的面吹牛,可吹牛的內容都是我干的。 我是一名探鬼主播,決...
    沈念sama閱讀 43,024評論 3 440
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了?” 一聲冷哼從身側響起,我...
    開封第一講書人閱讀 42,170評論 0 287
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎,沒想到半個月后,有當地人在樹林里發現了一具尸體,經...
    沈念sama閱讀 48,709評論 1 333
  • 正文 獨居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內容為張勛視角 年9月15日...
    茶點故事閱讀 40,597評論 3 354
  • 正文 我和宋清朗相戀三年,在試婚紗的時候發現自己被綠了。 大學時的朋友給我發了我未婚夫和他白月光在一起吃飯的照片。...
    茶點故事閱讀 42,784評論 1 369
  • 序言:一個原本活蹦亂跳的男人離奇死亡,死狀恐怖,靈堂內的尸體忽然破棺而出,到底是詐尸還是另有隱情,我是刑警寧澤,帶...
    沈念sama閱讀 38,291評論 5 357
  • 正文 年R本政府宣布,位于F島的核電站,受9級特大地震影響,放射性物質發生泄漏。R本人自食惡果不足惜,卻給世界環境...
    茶點故事閱讀 44,029評論 3 347
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧,春花似錦、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 34,407評論 0 25
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至,卻和暖如春,著一層夾襖步出監牢的瞬間,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 35,663評論 1 280
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留,地道東北人。 一個月前我還...
    沈念sama閱讀 51,403評論 3 390
  • 正文 我出身青樓,卻偏偏與公主長得像,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個殘疾皇子,可洞房花燭夜當晚...
    茶點故事閱讀 47,746評論 2 370

推薦閱讀更多精彩內容