TCGAbiolinks
包是一站式分析TCGA數據的R包工具,它集成了TCGA數據下載、分析、可視化的全部流程。此次系列筆記主要跟著 TCGAbiolinks幫助文檔重新學習下TCGA數據挖掘流程。
- 官方文檔:https://bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/index.html
- 文獻:TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data https://pubmed.ncbi.nlm.nih.gov/26704973/
一、查找感興趣的TCGA數據
GDCquery()
GDCquery(
project,
data.category,
data.type,
workflow.type,
legacy = FALSE,
access,
platform,
file.type,
barcode,
data.format,
experimental.strategy,
sample.type
)
1、可設置的參數
1.1、根據腫瘤類型
-
project
參數:指定一個或多個感興趣的TCGA項目名 - 如下代碼所示,供包括33種TCGA癌癥類型
projects = TCGAbiolinks:::getGDCprojects()$project_id
TCGAs = grep("TCGA", projects, value = T)
sort(TCGAs)
# [1] "TCGA-ACC" "TCGA-BLCA" "TCGA-BRCA" "TCGA-CESC" "TCGA-CHOL" "TCGA-COAD"
# [7] "TCGA-DLBC" "TCGA-ESCA" "TCGA-GBM" "TCGA-HNSC" "TCGA-KICH" "TCGA-KIRC"
# [13] "TCGA-KIRP" "TCGA-LAML" "TCGA-LGG" "TCGA-LIHC" "TCGA-LUAD" "TCGA-LUSC"
# [19] "TCGA-MESO" "TCGA-OV" "TCGA-PAAD" "TCGA-PCPG" "TCGA-PRAD" "TCGA-READ"
# [25] "TCGA-SARC" "TCGA-SKCM" "TCGA-STAD" "TCGA-TGCT" "TCGA-THCA" "TCGA-THYM"
# [31] "TCGA-UCEC" "TCGA-UCS" "TCGA-UVM"
Study Abbreviation | Study Name | 中文名 |
---|---|---|
ACC | Adrenocortical carcinoma | 腎上腺皮質癌 |
BLCA | Bladder Urothelial Carcinoma | 膀胱尿路上皮癌 |
BRCA | Breast invasive carcinoma | 浸潤性乳腺癌 |
CESC | Cervical squamous cell carcinoma and endocervical adenocarcinoma | 宮頸鱗狀細胞癌和宮頸內腺癌 |
CHOL | Cholangiocarcinoma | 膽管癌 |
COAD | Colon adenocarcinoma | 結腸腺癌 |
DLBC | Lymphoid Neoplasm Diffuse Large B-cell Lymphoma | 淋巴樣腫瘤彌漫大b細胞淋巴瘤 |
ESCA | Esophageal carcinoma | 食管癌癌 |
GBM | Glioblastoma multiforme | 多形性成膠質細胞瘤 |
HNSC | Head and Neck squamous cell carcinoma | 頭頸部鱗狀細胞癌 |
KICH | Kidney Chromophobe | 腎嫌色細胞癌 |
KIRC | Kidney renal clear cell carcinoma | 腎透明細胞癌 |
KIRP | Kidney renal papillary cell carcinoma | 腎乳頭狀細胞癌 |
LAML | Acute Myeloid Leukemia | 急性髓系白血病 |
LGG | Brain Lower Grade Glioma | 腦低級別膠質瘤 |
LIHC | Liver hepatocellular carcinoma | 肝臟肝細胞癌 |
LUAD | Lung adenocarcinoma | 肺腺癌 |
LUSC | Lung squamous cell carcinoma | 肺鱗癌 |
MESO | Mesothelioma | 間皮瘤 |
OV | Ovarian serous cystadenocarcinoma | 卵巢漿液性囊腺癌 |
PAAD | Pancreatic adenocarcinoma | 胰腺腺癌 |
PCPG | Pheochromocytoma and Paraganglioma | 嗜鉻細胞瘤和副神經節瘤 |
PRAD | Prostate adenocarcinoma | 前列腺腺癌 |
READ | Rectum adenocarcinoma | 直腸腺癌 |
SARC | Sarcoma | 肉瘤 |
SKCM | Skin Cutaneous Melanoma | 皮膚皮膚黑色素瘤 |
STAD | Stomach adenocarcinoma | 胃腺癌 |
TGCT | Testicular Germ Cell Tumors | 睪丸生殖細胞腫瘤 |
THCA | Thyroid carcinoma | 甲狀腺癌 |
THYM | Thymoma | 胸腺瘤 |
UCEC | Uterine Corpus Endometrial Carcinoma | 子宮內膜癌 |
UCS | Uterine Carcinosarcoma | 子宮癌肉瘤 |
UVM | Uveal Melanoma | 葡萄膜黑色素瘤 |
1.2 hg19/hg38
- 主要根據參考基因組的不同,包含兩套數據:GDC Legacy Archive【主要GRCh37 (hg19)】,GDC harmonized database【GRCh38 (hg38)】
- 通過設置參數
legacy
,默認為FALSE(hg19);TRUE則表示使用hg38參考基因組的測序數據。
1.3 下載數據類型
基于上述的參數,我們可以設置如下參數,交代我們的目標數據類型
-
data.category =
指定下載什么類型的數據:如組學數據、臨床數據....
#查看某一種腫瘤所包含的數據類型
TCGAbiolinks:::getProjectSummary("TCGA-BRCA")$data_categories
# file_count case_count data_category
# 1 4679 1098 Sequencing Reads
# 2 1183 1098 Clinical
# 3 6627 1098 Copy Number Variation
# 4 5315 1098 Biospecimen
# 5 1234 1095 DNA Methylation
# 6 6080 1097 Transcriptome Profiling
# 7 8648 1044 Simple Nucleotide Variation
-
data.type =
更加細節的數據類型選擇(optional) -
workflow.type =
同一個測序數據可能有不同的pipeline處理流程(optional, for harmonized ) -
platform =
測序平臺(optional) -
file.type =
具體的數據文件(optional, for legacy)
如果不知道目標數據的上述信息,可以參考下面的概述
GDC harmonized database
Data.category | Data.type | Workflow.Type | Platform |
---|---|---|---|
Transcriptome Profiling | Gene Expression Quantification | HTSeq - Counts | |
Transcriptome Profiling | Gene Expression Quantification | HTSeq - FPKM | |
Transcriptome Profiling | Gene Expression Quantification | HTSeq - FPKM-UQ | |
Transcriptome Profiling | Gene Expression Quantification | STAR - Counts | |
Transcriptome Profiling | Isoform Expression Quantification | - | |
Transcriptome Profiling | miRNA Expression Quantification | - | |
Transcriptome Profiling | Splice Junction Quantification | ||
Copy number variation | Copy Number Segment | ||
Copy number variation | Masked Copy Number Segment | ||
Copy number variation | Gene Level Copy Number Scores | ||
Simple Nucleotide Variation | Masked Somatic Mutation | MuSE Variant Aggregation and Masking | |
Simple Nucleotide Variation | Masked Somatic Mutation | MuTect2 Variant Aggregation and Masking | |
Simple Nucleotide Variation | Masked Somatic Mutation | SomaticSniper Variant Aggregation and Masking | |
Simple Nucleotide Variation | Masked Somatic Mutation | VarScan2 Variant Aggregation and Masking | |
Raw Sequencing Data | - | ||
Biospecimen | Slide Image | ||
Biospecimen | Biospecimen Supplement | ||
Clinical | - | ||
DNA Methylation | Methylation Beta Value | Illumina Human Methylation 450 | |
DNA Methylation | Methylation Beta Value | Illumina Human Methylation 27 |
GDC Legacy Archive
Data.category | Data.type | Platform | file.type |
---|---|---|---|
Copy number variation | - | Affymetrix SNP Array 6.0 | nocnv_hg18.seg |
Copy number variation | - | Affymetrix SNP Array 6.0 | hg18.seg |
Copy number variation | - | Affymetrix SNP Array 6.0 | nocnv_hg19.seg |
Copy number variation | - | Affymetrix SNP Array 6.0 | hg19.seg |
Copy number variation | - | Illumina HiSeq | - |
Simple nucleotide variation | Simple somatic mutation | ||
Raw sequencing data | |||
Biospecimen | |||
Clinical | |||
Protein expression | MDA RPPA Core | - | |
Gene expression | Gene expression quantification | Illumina HiSeq | normalized_results |
Gene expression | Gene expression quantification | Illumina HiSeq | results |
Gene expression | Gene expression quantification | HT_HG-U133A | - |
Gene expression | Gene expression quantification | AgilentG4502A_07_2 | - |
Gene expression | Gene expression quantification | AgilentG4502A_07_1 | - |
Gene expression | Gene expression quantification | HuEx-1_0-st-v2 | FIRMA.txt |
Gene expression | Gene expression quantification | gene.txt | |
Gene expression | Isoform expression quantification | - | - |
Gene expression | miRNA gene quantification | - | hg19.mirna |
Gene expression | miRNA gene quantification | hg19.mirbase20 | |
Gene expression | miRNA gene quantification | mirna | |
Gene expression | Exon junction quantification | - | - |
Gene expression | Exon quantification | - | - |
Gene expression | miRNA isoform quantification | - | hg19.isoform |
Gene expression | miRNA isoform quantification | - | isoform |
DNA methylation | Illumina Human Methylation 450 | Not used | |
DNA methylation | Illumina Human Methylation 27 | Not used | |
DNA methylation | Illumina DNA Methylation OMA003 CPI | Not used | |
DNA methylation | Illumina DNA Methylation OMA002 CPI | Not used | |
DNA methylation | Illumina Hi Seq | ||
DNA methylation | Bisulfite sequence alignment | ||
DNA methylation | Methylation percentage | ||
DNA methylation | Aligned reads | ||
Raw microarray data | Raw intensities | Illumina Human Methylation 450 | idat |
Raw Microarray Data | Raw intensities | Illumina Human Methylation 27 | idat |
Structural Rearrangement | |||
Other |
1.4 樣本標簽Barcode
完整的barcode:形如 TCGA-G4-6317-02A-11D-2064-05,這個標簽包含了從病人來源到測序過程、分析的所有信息,如下圖所示比較重要的是Participant
、Sample
、Portion
三個部分,分別交代了病人編號、樣本類型、測序類型
病人的id:形如 TCGA-G4-6317
樣本來源的id:形如 TCGA-G4-6317-02
-
其中比較重要的是交代樣本類型的
Sample
的兩位數信息,是后面進行差異分析的分組依據。具體對應的含義如下。例如01
表示病人的原位瘤組織;11
表示來自病人的正常組織....
基于上述理解,我們也可以設置
sample.type =
參數指定下載感興趣的樣本類型數據,例如sample.type = "Primary Tumor"
對于給定的TCGA barcode,可以利用
TCGAquery_SampleTypes()
提取出目標分組的樣本;TCGAquery_MatchedCoupledSampleTypes()
函數可以提取來自同一病人的配對樣本數據。
query <- GDCquery(project = c("TCGA-BRCA"),
legacy = FALSE, #default(GDC harmonized database)
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts")
dim(getResults(query))
#[1] 1222 29
query_info = getResults(query)
TP = TCGAquery_SampleTypes(query_info$sample.submitter_id,"TP")
NT = TCGAquery_SampleTypes(query_info$sample.submitter_id,"NT")
query <- GDCquery(project = c("TCGA-BRCA"),
legacy = FALSE, #default(GDC harmonized database)
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts",
barcode = c(TP, NT))
dim(getResults(query))
#[1] 1215 29
Pair_sample = TCGAquery_MatchedCoupledSampleTypes(query_info$sample.submitter_id,c("NT","TP"))
query <- GDCquery(project = c("TCGA-BRCA"),
legacy = FALSE, #default(GDC harmonized database)
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts",
barcode = Pair_sample)
dim(getResults(query))
#[1] 229 29
如上是查詢TCGA目標數據的幾種常見標準,還有幾個參數沒有介紹,可參看函數幫助文檔。可根據自己的目的靈活設置上述參數。
2、query示例
2.1 膽管癌轉錄組數據 | hg19 | 所有樣本
TCGAbiolinks:::getProjectSummary("TCGA-CHOL",legacy = TRUE)$data_categories
# file_count case_count data_category
# 1 30 30 Protein expression
# 2 680 36 Copy number variation
# 3 51 51 Biospecimen
# 4 444 36 Simple nucleotide variation
# 5 450 36 Gene expression
# 6 686 36 Raw microarray data
# 7 45 36 DNA methylation
# 8 193 51 Clinical
# 9 365 51 Raw sequencing data
query <- GDCquery(project = "TCGA-CHOL",
legacy = TRUE,
data.category = "Gene expression",
data.type = "Gene expression quantification",
platform = "Illumina HiSeq",
file.type = "normalized_results")
dim(getResults(query))
#[1] 45 32
t(getResults(query)[1,])
# 1
# id "34216957-50e3-434c-8c38-72f0f2ddcf16"
# data_format "TXT"
# access "open"
# cases "TCGA-3X-AAV9-01A-72R-A41I-07"
# file_name "unc.edu.59012a78-0e8f-4b99-af97-0dbb1d3d0513.2538862.rsem.genes.normalized_results"
# submitter_id NA
# data_category "Gene expression"
# type "file"
# file_size 437196
# platform "Illumina HiSeq"
# state_comment NA
# tags character,3
# updated_datetime "2017-03-05T10:11:44.298823-06:00"
# md5sum "23836c9f9bdb053c567d91a67b62159d"
# file_id "34216957-50e3-434c-8c38-72f0f2ddcf16"
# data_type "Gene expression quantification"
# state "live"
# experimental_strategy "RNA-Seq"
# file_state "submitted"
# version "1"
# data_release "0.0 - 29.0"
# project "TCGA-CHOL"
# center_id "ee7a85b3-8177-5d60-a10c-51180eb9009c"
# center_center_type "CGCC"
# center_code "07"
# center_name "University of North Carolina"
# center_namespace "unc.edu"
# center_short_name "UNC"
# sample_type "Primary Tumor"
# is_ffpe FALSE
# cases.submitter_id "TCGA-3X-AAV9"
# sample.submitter_id "TCGA-3X-AAV9-01A"
2.2 肺腺癌癌轉錄組數據 | hg38 | 原位瘤+正常組織
TCGAbiolinks:::getProjectSummary("TCGA-LUAD",legacy = FALSE)$data_categories
# 4 2916 519 Transcriptome Profiling
query <- GDCquery(project = "TCGA-LUAD",
legacy = FALSE,
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts")
dim(getResults(query))
#[1] 594 29
2.3 乳腺癌甲基化數據 | hg19 | Illumina Human Methylation 450平臺
TCGAbiolinks:::getProjectSummary("TCGA-BRCA",legacy = TRUE)$data_categories
#7 1250 1097 DNA methylation
query <- GDCquery(project = "TCGA-BRCA",
legacy = TRUE,
data.category = "DNA methylation",
platform = "Illumina Human Methylation 450")
dim(getResults(query))
#[1] 895 32
二、根據選擇的query,下載數據
-
GDCdownload()
函數使用比較簡單,指定我們上一步得到的query
即可。 - 提供兩種下載方式:
api
與client
,前者較快,但有時不太穩定;后者較慢。推薦使用api
方式(default),當下載大文件時,可設置files.per.chunk = n
,表示分批下載,每批下載n個病人的數據,可避免中途報錯,而前功盡棄。 -
directory
表示下載到哪個文件夾,默認會創建、下載到GDCdata文件夾
GDCdownload(
query,
token.file,
method = "api",
directory = "GDCdata",
files.per.chunk = NULL
)
- 示例數據
query <- GDCquery(project = "TCGA-CHOL",
legacy = TRUE,
data.category = "Gene expression",
data.type = "Gene expression quantification",
platform = "Illumina HiSeq",
file.type = "normalized_results")
GDCdownload(query, files.per.chunk = 10)
# Downloading data for project TCGA-CHOL
# GDCdownload will download 45 files. A total of 19.580796 MB
# Downloading chunk 1 of 5 (10 files, size = 4.351703 MB) as Wed_Aug_18_21_52_08_2021_0.tar.gz
# Downloading: 1.9 MB Downloading chunk 2 of 5 (10 files, size = 4.350318 MB) as Wed_Aug_18_21_52_08_2021_1.tar.gz
# Downloading: 1.8 MB Downloading chunk 3 of 5 (10 files, size = 4.351067 MB) as Wed_Aug_18_21_52_08_2021_2.tar.gz
# Downloading: 1.8 MB Downloading chunk 4 of 5 (10 files, size = 4.353528 MB) as Wed_Aug_18_21_52_08_2021_3.tar.gz
# Downloading: 1.9 MB Downloading chunk 5 of 5 (5 files, size = 2.17418 MB) as Wed_Aug_18_21_52_08_2021_4.tar.gz
# Downloading: 900 kB
三、讀取已經下載到本地的文件到當前環境
-
GDCprepare()
會根據我們提供的query對象,以及下載數據的儲存目錄(默認也是GDCdata文件夾),完成數據讀取的操作,以SummarizedExperiment
格式展示。 - 還可設置
save = TRUE
、filename = ****
參數,在讀取數據后,自動將SummarizedExperiment對象保存為Rdata,以供之后方便調用(defalut
為FALSE)
query <- GDCquery(project = "TCGA-CHOL",
legacy = TRUE,
data.category = "Gene expression",
data.type = "Gene expression quantification",
platform = "Illumina HiSeq",
file.type = "normalized_results")
GDCdownload(query, files.per.chunk = 10)
data <- GDCprepare(query, save = T, save.filename = "CHOL_RNAseq.rda")
# -------------------
# oo Reading 45 files
# -------------------
# |=================================================|100% Completed after 0 s
# -------------------
# oo Merging 45 files
# -------------------
# Starting to add information to samples
# => Add clinical information to samples
# => Adding TCGA molecular information from marker papers
# => Information will have prefix 'paper_'
# chol subtype information from:doi:10.1016/j.celrep.2017.02.033
# => Saving file: CHOL_RNAseq.rda
# => File saved
-
GDCprepare()
在讀取數據的過程中,會自動進行樣本信息、基因信息的注釋。但目前這還不能支持全部類型數據。
library(SummarizedExperiment)
#表達矩陣信息
dim(assay(data))
#[1] 19947 45
assays(data)
# List of length 1
# names(1): normalized_count
assay(data, "normalized_count")[1:4,1:4]
# TCGA-3X-AAV9-01A-72R-A41I-07 TCGA-3X-AAVC-01A-21R-A41I-07 TCGA-W5-AA2R-11A-11R-A41I-07 TCGA-ZH-A8Y4-01A-11R-A41I-07
# A1BG 70.9581 29.9768 108409.2249 1485.0630
# A2M 23986.2548 8129.6961 98095.2358 7119.1570
# NAT1 72.4007 52.8682 160.2275 76.5504
# NAT2 8.7099 0.0000 1472.3868 23.2558
#樣本(臨床)信息
dim(colData(data))
#[1] 45 205
colData(data)[1:4,1:4]
# DataFrame with 4 rows and 4 columns
# barcode patient sample shortLetterCode
# <character> <character> <character> <character>
# TCGA-3X-AAV9-01A-72R-A41I-07 TCGA-3X-AAV9-01A-72R.. TCGA-3X-AAV9 TCGA-3X-AAV9-01A TP
# TCGA-3X-AAVC-01A-21R-A41I-07 TCGA-3X-AAVC-01A-21R.. TCGA-3X-AAVC TCGA-3X-AAVC-01A TP
# TCGA-W5-AA2R-11A-11R-A41I-07 TCGA-W5-AA2R-11A-11R.. TCGA-W5-AA2R TCGA-W5-AA2R-11A NT
# TCGA-ZH-A8Y4-01A-11R-A41I-07 TCGA-ZH-A8Y4-01A-11R.. TCGA-ZH-A8Y4 TCGA-ZH-A8Y4-01A TP
#不同的基因ID類型
dim(rowData(data))
#[1] 19947 3
rowData(data)[1:6,1:3]
# DataFrame with 6 rows and 3 columns
# gene_id entrezgene ensembl_gene_id
# <character> <integer> <character>
# A1BG A1BG 1 ENSG00000121410
# A2M A2M 2 ENSG00000175899
# NAT1 NAT1 9 ENSG00000171428
# NAT2 NAT2 10 ENSG00000156006
# RP11-986E7.7 RP11-986E7.7 12 ENSG00000273259
# AADAC AADAC 13 ENSG00000114771
#基因的坐標信息
rowRanges(data)
# GRanges object with 19947 ranges and 3 metadata columns:
# seqnames ranges strand | gene_id entrezgene ensembl_gene_id
# <Rle> <IRanges> <Rle> | <character> <integer> <character>
# A1BG chr19 58856544-58864865 - | A1BG 1 ENSG00000121410
# A2M chr12 9220260-9268825 - | A2M 2 ENSG00000175899
# NAT1 chr8 18027986-18081198 + | NAT1 9 ENSG00000171428
# NAT2 chr8 18248755-18258728 + | NAT2 10 ENSG00000156006
# RP11-986E7.7 chr14 95058395-95090983 + | RP11-986E7.7 12 ENSG00000273259
# ... ... ... ... . ... ... ...
# RASAL2-AS1 chr1 178060643-178063119 - | RASAL2-AS1 100302401 ENSG00000224687
# LINC00882 chr3 106555658-106959488 - | LINC00882 100302640 ENSG00000242759
# FTX chrX 73183790-73513409 - | FTX 100302692 ENSG00000230590
# TICAM2 chr5 114914339-114961876 - | TICAM2 100302736 ENSG00000243414
# SLC25A5-AS1 chrX 118599997-118603061 - | SLC25A5-AS1 100303728 ENSG00000224281
# -------
# seqinfo: 24 sequences from an unspecified genome; no seqlengths
以上就是查找數據,下載數據,讀取數據的全部流程,接下來就可以開始分析數據了~
補充:關于病人的臨床數據與腫瘤分型
1、獲取病人的臨床數據
- 如上在
GDCprepare()
過程中,會自動注釋病人樣本的臨床信息。 - 我們也可以預先單獨下載每個病人的臨床數據,以供參考。
方法一:GDCquery() pipeline
query <- GDCquery(project = "TCGA-ACC",
data.category = "Clinical",
data.type = "Clinical Supplement",
data.format = "BCR Biotab")
GDCdownload(query, files.per.chunk = 20)
clinical.BCRtab.all <- GDCprepare(query)
grep("clinical_", names(clinical.BCRtab.all), value = T)
# [1] "clinical_drug_brca" "clinical_omf_v4.0_brca"
# [3] "clinical_follow_up_v4.0_brca" "clinical_follow_up_v1.5_brca"
# [5] "clinical_follow_up_v4.0_nte_brca" "clinical_patient_brca"
# [7] "clinical_radiation_brca" "clinical_nte_brca"
# [9] "clinical_follow_up_v2.1_brca"
clinical_patient_brca = as.data.frame(clinical.BCRtab.all$clinical_patient_brca)
clinical_patient_brca[1:4,1:4]
# bcr_patient_uuid bcr_patient_barcode form_completion_date prospective_collection
# 1 bcr_patient_uuid bcr_patient_barcode form_completion_date tissue_prospective_collection_indicator
# 2 CDE_ID: CDE_ID:2003301 CDE_ID: CDE_ID:3088492
# 3 6E7D5EC6-A469-467C-B748-237353C23416 TCGA-3C-AAAU 2014-1-13 NO
# 4 55262FCB-1B01-4480-B322-36570430C917 TCGA-3C-AALI 2014-7-28 NO
方法二:GDCquery_clinic()
- 根據官方介紹,這個函數下載的是indexed clinical: a refined clinical data that is created using the XML files(方法一).
- 這種方法下載速度較快,建議優先使用。如果沒有想要的信息,再使用方法一。
clinical <- GDCquery_clinic(project = "TCGA-BRCA", type = "clinical")
clinical <- GDCquery_clinic(project = "TCGA-BRCA", type = "clinical")
clinical[1:4,1:4]
# submitter_id synchronous_malignancy ajcc_pathologic_stage tumor_stage
# 1 TCGA-E2-A14U No Stage I stage i
# 2 TCGA-E9-A1RC No Stage IIIC stage iiic
# 3 TCGA-D8-A1J9 No Stage IA stage ia
# 4 TCGA-E2-A14P No Stage IIIC stage iiic
2、獲取病人的腫瘤分型
-
PanCancerAtlas_subtypes()
The columns “Subtype_Selected” was selected as most prominent subtype classification (from the other columns)
subtypes <- PanCancerAtlas_subtypes()
dim(subtypes)
#[1] 7734 10
table(subtypes$cancer.type)
# ACC AML BLCA BRCA COAD ESCA GBM HNSC KICH KIRC KIRP LGG LIHC LUAD LUSC OVCA PCPG
# 91 187 129 1218 341 169 606 279 66 442 161 516 196 230 178 489 178
# PRAD READ SKCM STAD THCA UCEC UCS
# 333 118 333 383 496 538 57
head(as.data.frame(subtypes))
# pan.samplesID cancer.type Subtype_mRNA Subtype_DNAmeth Subtype_protein Subtype_miRNA Subtype_CNA Subtype_Integrative Subtype_other Subtype_Selected
# 1 TCGA-OR-A5J1 ACC steroid-phenotype-high+proliferation CIMP-high NA miRNA_1 Quiet COC3 C1A ACC.CIMP-high
# 2 TCGA-OR-A5J2 ACC steroid-phenotype-high+proliferation CIMP-low 1 miRNA_1 Noisy COC3 C1A ACC.CIMP-low
# 3 TCGA-OR-A5J3 ACC steroid-phenotype-high CIMP-intermediate 3 miRNA_6 Chromosomal COC2 C1A ACC.CIMP-intermediate
# 4 TCGA-OR-A5J4 ACC <NA> CIMP-high NA miRNA_6 Chromosomal <NA> <NA> ACC.CIMP-high
# 5 TCGA-OR-A5J5 ACC steroid-phenotype-high CIMP-intermediate NA miRNA_2 Chromosomal COC2 C1A ACC.CIMP-intermediate
# 6 TCGA-OR-A5J6 ACC steroid-phenotype-low CIMP-low 2 miRNA_1 Noisy COC1 C1B ACC.CIMP-low
-
TCGAquery_subtype()
These subtypes will be automatically added in the summarizedExperiment object through GDCprepare. But you can also use the TCGAquery_subtype function to retrieve this information.
brca.subtype <- TCGAquery_subtype(tumor = "brca")
t(brca.subtype[1,])
# [,1]
# patient "TCGA-3C-AAAU"
# Tumor.Type "BRCA"
# Included_in_previous_marker_papers "NO"
# vital_status "Alive"
# days_to_birth "-20211"
# days_to_death "NA"
# days_to_last_followup "4047"
# age_at_initial_pathologic_diagnosis "55"
# pathologic_stage "NA"
# Tumor_Grade "NA"
# BRCA_Pathology "NA"
# BRCA_Subtype_PAM50 "LumA"
# MSI_status "NA"
# HPV_Status "NA"
# tobacco_smoking_history "NA"
# CNV Clusters "C6"
# Mutation Clusters "C7"
# DNA.Methylation Clusters "C1"
# mRNA Clusters "C1"
# miRNA Clusters "C3"
# lncRNA Clusters "NA"
# Protein Clusters "NA"
# PARADIGM Clusters "C5"
# Pan-Gyn Clusters "NA"
GDCquery_Maf()
函數可以支持下載突變數據,這里就暫時不學習了。之后有機會再了解一下。