[R]TCGAbiolinks包：數據準備--query、download、prepare

TCGAbiolinks包是一站式分析TCGA數據的R包工具，它集成了TCGA數據下載、分析、可視化的全部流程。此次系列筆記主要跟著 TCGAbiolinks幫助文檔重新學習下TCGA數據挖掘流程。

官方文檔：https://bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/index.html

文獻：TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data https://pubmed.ncbi.nlm.nih.gov/26704973/

一、查找感興趣的TCGA數據

GDCquery()

GDCquery(
  project,
  data.category,
  data.type,
  workflow.type,
  legacy = FALSE,
  access,
  platform,
  file.type,
  barcode,
  data.format,
  experimental.strategy,
  sample.type
)

1、可設置的參數

1.1、根據腫瘤類型

project參數：指定一個或多個感興趣的TCGA項目名
如下代碼所示，供包括33種TCGA癌癥類型

projects = TCGAbiolinks:::getGDCprojects()$project_id
TCGAs = grep("TCGA", projects, value = T)
sort(TCGAs)
# [1] "TCGA-ACC"  "TCGA-BLCA" "TCGA-BRCA" "TCGA-CESC" "TCGA-CHOL" "TCGA-COAD"
# [7] "TCGA-DLBC" "TCGA-ESCA" "TCGA-GBM"  "TCGA-HNSC" "TCGA-KICH" "TCGA-KIRC"
# [13] "TCGA-KIRP" "TCGA-LAML" "TCGA-LGG"  "TCGA-LIHC" "TCGA-LUAD" "TCGA-LUSC"
# [19] "TCGA-MESO" "TCGA-OV"   "TCGA-PAAD" "TCGA-PCPG" "TCGA-PRAD" "TCGA-READ"
# [25] "TCGA-SARC" "TCGA-SKCM" "TCGA-STAD" "TCGA-TGCT" "TCGA-THCA" "TCGA-THYM"
# [31] "TCGA-UCEC" "TCGA-UCS"  "TCGA-UVM"

Study Abbreviation	Study Name	中文名
ACC	Adrenocortical carcinoma	腎上腺皮質癌
BLCA	Bladder Urothelial Carcinoma	膀胱尿路上皮癌
BRCA	Breast invasive carcinoma	浸潤性乳腺癌
CESC	Cervical squamous cell carcinoma and endocervical adenocarcinoma	宮頸鱗狀細胞癌和宮頸內腺癌
CHOL	Cholangiocarcinoma	膽管癌
COAD	Colon adenocarcinoma	結腸腺癌
DLBC	Lymphoid Neoplasm Diffuse Large B-cell Lymphoma	淋巴樣腫瘤彌漫大b細胞淋巴瘤
ESCA	Esophageal carcinoma	食管癌癌
GBM	Glioblastoma multiforme	多形性成膠質細胞瘤
HNSC	Head and Neck squamous cell carcinoma	頭頸部鱗狀細胞癌
KICH	Kidney Chromophobe	腎嫌色細胞癌
KIRC	Kidney renal clear cell carcinoma	腎透明細胞癌
KIRP	Kidney renal papillary cell carcinoma	腎乳頭狀細胞癌
LAML	Acute Myeloid Leukemia	急性髓系白血病
LGG	Brain Lower Grade Glioma	腦低級別膠質瘤
LIHC	Liver hepatocellular carcinoma	肝臟肝細胞癌
LUAD	Lung adenocarcinoma	肺腺癌
LUSC	Lung squamous cell carcinoma	肺鱗癌
MESO	Mesothelioma	間皮瘤
OV	Ovarian serous cystadenocarcinoma	卵巢漿液性囊腺癌
PAAD	Pancreatic adenocarcinoma	胰腺腺癌
PCPG	Pheochromocytoma and Paraganglioma	嗜鉻細胞瘤和副神經節瘤
PRAD	Prostate adenocarcinoma	前列腺腺癌
READ	Rectum adenocarcinoma	直腸腺癌
SARC	Sarcoma	肉瘤
SKCM	Skin Cutaneous Melanoma	皮膚皮膚黑色素瘤
STAD	Stomach adenocarcinoma	胃腺癌
TGCT	Testicular Germ Cell Tumors	睪丸生殖細胞腫瘤
THCA	Thyroid carcinoma	甲狀腺癌
THYM	Thymoma	胸腺瘤
UCEC	Uterine Corpus Endometrial Carcinoma	子宮內膜癌
UCS	Uterine Carcinosarcoma	子宮癌肉瘤
UVM	Uveal Melanoma	葡萄膜黑色素瘤

1.2 hg19/hg38

主要根據參考基因組的不同，包含兩套數據：GDC Legacy Archive【主要GRCh37 (hg19)】，GDC harmonized database【GRCh38 (hg38)】
通過設置參數legacy ，默認為FALSE(hg19)；TRUE則表示使用hg38參考基因組的測序數據。

1.3 下載數據類型

基于上述的參數，我們可以設置如下參數，交代我們的目標數據類型

data.category = 指定下載什么類型的數據：如組學數據、臨床數據....

#查看某一種腫瘤所包含的數據類型
TCGAbiolinks:::getProjectSummary("TCGA-BRCA")$data_categories
#   file_count case_count               data_category
# 1       4679       1098            Sequencing Reads
# 2       1183       1098                    Clinical
# 3       6627       1098       Copy Number Variation
# 4       5315       1098                 Biospecimen
# 5       1234       1095             DNA Methylation
# 6       6080       1097     Transcriptome Profiling
# 7       8648       1044 Simple Nucleotide Variation

data.type = 更加細節的數據類型選擇(optional)
workflow.type = 同一個測序數據可能有不同的pipeline處理流程(optional, for harmonized )
platform = 測序平臺（optional）
file.type = 具體的數據文件（optional, for legacy）
如果不知道目標數據的上述信息，可以參考下面的概述

GDC harmonized database

Data.category	Data.type	Workflow.Type	Platform
Transcriptome Profiling	Gene Expression Quantification	HTSeq - Counts
Transcriptome Profiling	Gene Expression Quantification	HTSeq - FPKM
Transcriptome Profiling	Gene Expression Quantification	HTSeq - FPKM-UQ
Transcriptome Profiling	Gene Expression Quantification	STAR - Counts
Transcriptome Profiling	Isoform Expression Quantification	-
Transcriptome Profiling	miRNA Expression Quantification	-
Transcriptome Profiling	Splice Junction Quantification
Copy number variation	Copy Number Segment
Copy number variation	Masked Copy Number Segment
Copy number variation	Gene Level Copy Number Scores
Simple Nucleotide Variation	Masked Somatic Mutation	MuSE Variant Aggregation and Masking
Simple Nucleotide Variation	Masked Somatic Mutation	MuTect2 Variant Aggregation and Masking
Simple Nucleotide Variation	Masked Somatic Mutation	SomaticSniper Variant Aggregation and Masking
Simple Nucleotide Variation	Masked Somatic Mutation	VarScan2 Variant Aggregation and Masking
Raw Sequencing Data	-
Biospecimen	Slide Image
Biospecimen	Biospecimen Supplement
Clinical	-
DNA Methylation	Methylation Beta Value		Illumina Human Methylation 450
DNA Methylation	Methylation Beta Value		Illumina Human Methylation 27

GDC Legacy Archive

Data.category	Data.type	Platform	file.type
Copy number variation	-	Affymetrix SNP Array 6.0	nocnv_hg18.seg
Copy number variation	-	Affymetrix SNP Array 6.0	hg18.seg
Copy number variation	-	Affymetrix SNP Array 6.0	nocnv_hg19.seg
Copy number variation	-	Affymetrix SNP Array 6.0	hg19.seg
Copy number variation	-	Illumina HiSeq	-
Simple nucleotide variation	Simple somatic mutation
Raw sequencing data
Biospecimen
Clinical
Protein expression		MDA RPPA Core	-
Gene expression	Gene expression quantification	Illumina HiSeq	normalized_results
Gene expression	Gene expression quantification	Illumina HiSeq	results
Gene expression	Gene expression quantification	HT_HG-U133A	-
Gene expression	Gene expression quantification	AgilentG4502A_07_2	-
Gene expression	Gene expression quantification	AgilentG4502A_07_1	-
Gene expression	Gene expression quantification	HuEx-1_0-st-v2	FIRMA.txt
Gene expression	Gene expression quantification		gene.txt
Gene expression	Isoform expression quantification	-	-
Gene expression	miRNA gene quantification	-	hg19.mirna
Gene expression	miRNA gene quantification		hg19.mirbase20
Gene expression	miRNA gene quantification		mirna
Gene expression	Exon junction quantification	-	-
Gene expression	Exon quantification	-	-
Gene expression	miRNA isoform quantification	-	hg19.isoform
Gene expression	miRNA isoform quantification	-	isoform
DNA methylation		Illumina Human Methylation 450	Not used
DNA methylation		Illumina Human Methylation 27	Not used
DNA methylation		Illumina DNA Methylation OMA003 CPI	Not used
DNA methylation		Illumina DNA Methylation OMA002 CPI	Not used
DNA methylation		Illumina Hi Seq
DNA methylation	Bisulfite sequence alignment
DNA methylation	Methylation percentage
DNA methylation	Aligned reads
Raw microarray data	Raw intensities	Illumina Human Methylation 450	idat
Raw Microarray Data	Raw intensities	Illumina Human Methylation 27	idat
Structural Rearrangement
Other

1.4 樣本標簽Barcode

完整的barcode：形如 TCGA-G4-6317-02A-11D-2064-05，這個標簽包含了從病人來源到測序過程、分析的所有信息，如下圖所示比較重要的是Participant、Sample 、Portion三個部分，分別交代了病人編號、樣本類型、測序類型
病人的id：形如 TCGA-G4-6317
樣本來源的id：形如 TCGA-G4-6317-02

其中比較重要的是交代樣本類型的Sample的兩位數信息，是后面進行差異分析的分組依據。具體對應的含義如下。例如01表示病人的原位瘤組織；11表示來自病人的正常組織....
基于上述理解，我們也可以設置sample.type =參數指定下載感興趣的樣本類型數據，例如sample.type = "Primary Tumor"
對于給定的TCGA barcode，可以利用TCGAquery_SampleTypes()提取出目標分組的樣本；TCGAquery_MatchedCoupledSampleTypes()函數可以提取來自同一病人的配對樣本數據。

query <- GDCquery(project = c("TCGA-BRCA"),
                  legacy = FALSE, #default(GDC harmonized database)
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  workflow.type = "HTSeq - Counts")
dim(getResults(query))
#[1] 1222   29
query_info = getResults(query)
TP = TCGAquery_SampleTypes(query_info$sample.submitter_id,"TP")
NT = TCGAquery_SampleTypes(query_info$sample.submitter_id,"NT")
query <- GDCquery(project = c("TCGA-BRCA"),
                  legacy = FALSE, #default(GDC harmonized database)
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  workflow.type = "HTSeq - Counts",
                  barcode = c(TP, NT))
dim(getResults(query))
#[1] 1215   29

Pair_sample = TCGAquery_MatchedCoupledSampleTypes(query_info$sample.submitter_id,c("NT","TP"))
query <- GDCquery(project = c("TCGA-BRCA"),
                  legacy = FALSE, #default(GDC harmonized database)
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  workflow.type = "HTSeq - Counts",
                  barcode = Pair_sample)
dim(getResults(query))
#[1] 229  29

如上是查詢TCGA目標數據的幾種常見標準，還有幾個參數沒有介紹，可參看函數幫助文檔。可根據自己的目的靈活設置上述參數。

2、query示例

2.1 膽管癌轉錄組數據 | hg19 | 所有樣本

TCGAbiolinks:::getProjectSummary("TCGA-CHOL",legacy = TRUE)$data_categories
#   file_count case_count               data_category
# 1         30         30          Protein expression
# 2        680         36       Copy number variation
# 3         51         51                 Biospecimen
# 4        444         36 Simple nucleotide variation
# 5        450         36             Gene expression
# 6        686         36         Raw microarray data
# 7         45         36             DNA methylation
# 8        193         51                    Clinical
# 9        365         51         Raw sequencing data
query <- GDCquery(project = "TCGA-CHOL",
                  legacy = TRUE,
                  data.category = "Gene expression",
                  data.type = "Gene expression quantification",
                  platform = "Illumina HiSeq", 
                  file.type  = "normalized_results")
dim(getResults(query))
#[1] 45 32
t(getResults(query)[1,])
#                       1                                                                                   
# id                    "34216957-50e3-434c-8c38-72f0f2ddcf16"                                              
# data_format           "TXT"                                                                               
# access                "open"                                                                              
# cases                 "TCGA-3X-AAV9-01A-72R-A41I-07"                                                      
# file_name             "unc.edu.59012a78-0e8f-4b99-af97-0dbb1d3d0513.2538862.rsem.genes.normalized_results"
# submitter_id          NA                                                                                  
# data_category         "Gene expression"                                                                   
# type                  "file"                                                                              
# file_size             437196                                                                              
# platform              "Illumina HiSeq"                                                                    
# state_comment         NA                                                                                  
# tags                  character,3                                                                         
# updated_datetime      "2017-03-05T10:11:44.298823-06:00"                                                  
# md5sum                "23836c9f9bdb053c567d91a67b62159d"                                                  
# file_id               "34216957-50e3-434c-8c38-72f0f2ddcf16"                                              
# data_type             "Gene expression quantification"                                                    
# state                 "live"                                                                              
# experimental_strategy "RNA-Seq"                                                                           
# file_state            "submitted"                                                                         
# version               "1"                                                                                 
# data_release          "0.0 - 29.0"                                                                        
# project               "TCGA-CHOL"                                                                         
# center_id             "ee7a85b3-8177-5d60-a10c-51180eb9009c"                                              
# center_center_type    "CGCC"                                                                              
# center_code           "07"                                                                                
# center_name           "University of North Carolina"                                                      
# center_namespace      "unc.edu"                                                                           
# center_short_name     "UNC"                                                                               
# sample_type           "Primary Tumor"                                                                     
# is_ffpe               FALSE                                                                               
# cases.submitter_id    "TCGA-3X-AAV9"                                                                      
# sample.submitter_id   "TCGA-3X-AAV9-01A"

2.2 肺腺癌癌轉錄組數據 | hg38 | 原位瘤+正常組織

TCGAbiolinks:::getProjectSummary("TCGA-LUAD",legacy = FALSE)$data_categories
# 4       2916        519     Transcriptome Profiling
query <- GDCquery(project = "TCGA-LUAD",
                  legacy = FALSE,
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  workflow.type = "HTSeq - Counts")
dim(getResults(query))
#[1] 594  29

2.3 乳腺癌甲基化數據 | hg19 | Illumina Human Methylation 450平臺

TCGAbiolinks:::getProjectSummary("TCGA-BRCA",legacy = TRUE)$data_categories
#7       1250       1097             DNA methylation
query <- GDCquery(project = "TCGA-BRCA",
                  legacy = TRUE,
                  data.category = "DNA methylation",
                  platform = "Illumina Human Methylation 450")
dim(getResults(query))
#[1] 895  32

二、根據選擇的query，下載數據

GDCdownload()函數使用比較簡單，指定我們上一步得到的query即可。
提供兩種下載方式：api與client，前者較快，但有時不太穩定；后者較慢。推薦使用api方式（default），當下載大文件時，可設置files.per.chunk = n，表示分批下載，每批下載n個病人的數據，可避免中途報錯，而前功盡棄。
directory表示下載到哪個文件夾，默認會創建、下載到GDCdata文件夾

GDCdownload(
  query,
  token.file,
  method = "api",
  directory = "GDCdata",
  files.per.chunk = NULL
)

示例數據

query <- GDCquery(project = "TCGA-CHOL",
                  legacy = TRUE,
                  data.category = "Gene expression",
                  data.type = "Gene expression quantification",
                  platform = "Illumina HiSeq", 
                  file.type  = "normalized_results")
GDCdownload(query, files.per.chunk = 10)
# Downloading data for project TCGA-CHOL
# GDCdownload will download 45 files. A total of 19.580796 MB
# Downloading chunk 1 of 5 (10 files, size = 4.351703 MB) as Wed_Aug_18_21_52_08_2021_0.tar.gz
# Downloading: 1.9 MB     Downloading chunk 2 of 5 (10 files, size = 4.350318 MB) as Wed_Aug_18_21_52_08_2021_1.tar.gz
# Downloading: 1.8 MB     Downloading chunk 3 of 5 (10 files, size = 4.351067 MB) as Wed_Aug_18_21_52_08_2021_2.tar.gz
# Downloading: 1.8 MB     Downloading chunk 4 of 5 (10 files, size = 4.353528 MB) as Wed_Aug_18_21_52_08_2021_3.tar.gz
# Downloading: 1.9 MB     Downloading chunk 5 of 5 (5 files, size = 2.17418 MB) as Wed_Aug_18_21_52_08_2021_4.tar.gz
# Downloading: 900 kB

三、讀取已經下載到本地的文件到當前環境

GDCprepare()會根據我們提供的query對象，以及下載數據的儲存目錄（默認也是GDCdata文件夾），完成數據讀取的操作，以SummarizedExperiment格式展示。
還可設置save = TRUE、filename = ****參數，在讀取數據后，自動將SummarizedExperiment對象保存為Rdata，以供之后方便調用（defalut
為FALSE）

query <- GDCquery(project = "TCGA-CHOL",
                  legacy = TRUE,
                  data.category = "Gene expression",
                  data.type = "Gene expression quantification",
                  platform = "Illumina HiSeq", 
                  file.type  = "normalized_results")
GDCdownload(query, files.per.chunk = 10)
data <- GDCprepare(query, save = T, save.filename = "CHOL_RNAseq.rda")
# -------------------
#   oo Reading 45 files
# -------------------
#   |=================================================|100%                      Completed after 0 s 
# -------------------
#   oo Merging 45 files
# -------------------
#   Starting to add information to samples
# => Add clinical information to samples
# => Adding TCGA molecular information from marker papers
# => Information will have prefix 'paper_' 
# chol subtype information from:doi:10.1016/j.celrep.2017.02.033
# => Saving file: CHOL_RNAseq.rda
# => File saved

GDCprepare()在讀取數據的過程中，會自動進行樣本信息、基因信息的注釋。但目前這還不能支持全部類型數據。

library(SummarizedExperiment)
#表達矩陣信息
dim(assay(data))
#[1] 19947    45
assays(data)
# List of length 1
# names(1): normalized_count
assay(data, "normalized_count")[1:4,1:4]
#       TCGA-3X-AAV9-01A-72R-A41I-07 TCGA-3X-AAVC-01A-21R-A41I-07 TCGA-W5-AA2R-11A-11R-A41I-07 TCGA-ZH-A8Y4-01A-11R-A41I-07
# A1BG                      70.9581                      29.9768                  108409.2249                    1485.0630
# A2M                    23986.2548                    8129.6961                   98095.2358                    7119.1570
# NAT1                      72.4007                      52.8682                     160.2275                      76.5504
# NAT2                       8.7099                       0.0000                    1472.3868                      23.2558

#樣本(臨床)信息
dim(colData(data))
#[1]  45 205
colData(data)[1:4,1:4]
# DataFrame with 4 rows and 4 columns
#                                         barcode      patient           sample shortLetterCode
#                                         <character>  <character>      <character>     <character>
# TCGA-3X-AAV9-01A-72R-A41I-07 TCGA-3X-AAV9-01A-72R.. TCGA-3X-AAV9 TCGA-3X-AAV9-01A              TP
# TCGA-3X-AAVC-01A-21R-A41I-07 TCGA-3X-AAVC-01A-21R.. TCGA-3X-AAVC TCGA-3X-AAVC-01A              TP
# TCGA-W5-AA2R-11A-11R-A41I-07 TCGA-W5-AA2R-11A-11R.. TCGA-W5-AA2R TCGA-W5-AA2R-11A              NT
# TCGA-ZH-A8Y4-01A-11R-A41I-07 TCGA-ZH-A8Y4-01A-11R.. TCGA-ZH-A8Y4 TCGA-ZH-A8Y4-01A              TP

#不同的基因ID類型
dim(rowData(data))
#[1] 19947     3
rowData(data)[1:6,1:3]
# DataFrame with 6 rows and 3 columns
#                   gene_id entrezgene ensembl_gene_id
#                   <character>  <integer>     <character>
# A1BG                 A1BG          1 ENSG00000121410
# A2M                   A2M          2 ENSG00000175899
# NAT1                 NAT1          9 ENSG00000171428
# NAT2                 NAT2         10 ENSG00000156006
# RP11-986E7.7 RP11-986E7.7         12 ENSG00000273259
# AADAC               AADAC         13 ENSG00000114771


#基因的坐標信息
rowRanges(data)
# GRanges object with 19947 ranges and 3 metadata columns:
#           seqnames              ranges strand |      gene_id entrezgene ensembl_gene_id
#         <Rle>           <IRanges>  <Rle> |  <character>  <integer>     <character>
# A1BG    chr19   58856544-58864865      - |         A1BG          1 ENSG00000121410
# A2M    chr12     9220260-9268825      - |          A2M          2 ENSG00000175899
# NAT1     chr8   18027986-18081198      + |         NAT1          9 ENSG00000171428
# NAT2     chr8   18248755-18258728      + |         NAT2         10 ENSG00000156006
# RP11-986E7.7    chr14   95058395-95090983      + | RP11-986E7.7         12 ENSG00000273259
# ...      ...                 ...    ... .          ...        ...             ...
# RASAL2-AS1     chr1 178060643-178063119      - |   RASAL2-AS1  100302401 ENSG00000224687
# LINC00882     chr3 106555658-106959488      - |    LINC00882  100302640 ENSG00000242759
# FTX     chrX   73183790-73513409      - |          FTX  100302692 ENSG00000230590
# TICAM2     chr5 114914339-114961876      - |       TICAM2  100302736 ENSG00000243414
# SLC25A5-AS1     chrX 118599997-118603061      - |  SLC25A5-AS1  100303728 ENSG00000224281
# -------
# seqinfo: 24 sequences from an unspecified genome; no seqlengths

以上就是查找數據，下載數據，讀取數據的全部流程，接下來就可以開始分析數據了~

補充：關于病人的臨床數據與腫瘤分型

1、獲取病人的臨床數據

如上在GDCprepare()過程中，會自動注釋病人樣本的臨床信息。
我們也可以預先單獨下載每個病人的臨床數據，以供參考。

方法一：GDCquery() pipeline

query <- GDCquery(project = "TCGA-ACC", 
                  data.category = "Clinical",
                  data.type = "Clinical Supplement", 
                  data.format = "BCR Biotab")
GDCdownload(query, files.per.chunk = 20)
clinical.BCRtab.all <- GDCprepare(query)


grep("clinical_", names(clinical.BCRtab.all), value = T)
# [1] "clinical_drug_brca"               "clinical_omf_v4.0_brca"          
# [3] "clinical_follow_up_v4.0_brca"     "clinical_follow_up_v1.5_brca"    
# [5] "clinical_follow_up_v4.0_nte_brca" "clinical_patient_brca"           
# [7] "clinical_radiation_brca"          "clinical_nte_brca"               
# [9] "clinical_follow_up_v2.1_brca" 
clinical_patient_brca = as.data.frame(clinical.BCRtab.all$clinical_patient_brca)
clinical_patient_brca[1:4,1:4]
#                       bcr_patient_uuid bcr_patient_barcode form_completion_date                  prospective_collection
# 1                     bcr_patient_uuid bcr_patient_barcode form_completion_date tissue_prospective_collection_indicator
# 2                              CDE_ID:      CDE_ID:2003301              CDE_ID:                          CDE_ID:3088492
# 3 6E7D5EC6-A469-467C-B748-237353C23416        TCGA-3C-AAAU            2014-1-13                                      NO
# 4 55262FCB-1B01-4480-B322-36570430C917        TCGA-3C-AALI            2014-7-28                                      NO

方法二：GDCquery_clinic()

根據官方介紹，這個函數下載的是indexed clinical: a refined clinical data that is created using the XML files(方法一).
這種方法下載速度較快，建議優先使用。如果沒有想要的信息，再使用方法一。

clinical <- GDCquery_clinic(project = "TCGA-BRCA", type = "clinical")
clinical <- GDCquery_clinic(project = "TCGA-BRCA", type = "clinical")
clinical[1:4,1:4]
#   submitter_id synchronous_malignancy ajcc_pathologic_stage tumor_stage
# 1 TCGA-E2-A14U                     No               Stage I     stage i
# 2 TCGA-E9-A1RC                     No            Stage IIIC  stage iiic
# 3 TCGA-D8-A1J9                     No              Stage IA    stage ia
# 4 TCGA-E2-A14P                     No            Stage IIIC  stage iiic

2、獲取病人的腫瘤分型

PanCancerAtlas_subtypes()
The columns “Subtype_Selected” was selected as most prominent subtype classification (from the other columns)

subtypes <- PanCancerAtlas_subtypes()
dim(subtypes)
#[1] 7734   10
table(subtypes$cancer.type)
# ACC  AML BLCA BRCA COAD ESCA  GBM HNSC KICH KIRC KIRP  LGG LIHC LUAD LUSC OVCA PCPG 
# 91  187  129 1218  341  169  606  279   66  442  161  516  196  230  178  489  178 
# PRAD READ SKCM STAD THCA UCEC  UCS 
# 333  118  333  383  496  538   57
head(as.data.frame(subtypes))
#   pan.samplesID cancer.type                         Subtype_mRNA   Subtype_DNAmeth Subtype_protein Subtype_miRNA Subtype_CNA Subtype_Integrative Subtype_other      Subtype_Selected
# 1  TCGA-OR-A5J1         ACC steroid-phenotype-high+proliferation         CIMP-high              NA       miRNA_1       Quiet                COC3           C1A         ACC.CIMP-high
# 2  TCGA-OR-A5J2         ACC steroid-phenotype-high+proliferation          CIMP-low               1       miRNA_1       Noisy                COC3           C1A          ACC.CIMP-low
# 3  TCGA-OR-A5J3         ACC               steroid-phenotype-high CIMP-intermediate               3       miRNA_6 Chromosomal                COC2           C1A ACC.CIMP-intermediate
# 4  TCGA-OR-A5J4         ACC                                 <NA>         CIMP-high              NA       miRNA_6 Chromosomal                <NA>          <NA>         ACC.CIMP-high
# 5  TCGA-OR-A5J5         ACC               steroid-phenotype-high CIMP-intermediate              NA       miRNA_2 Chromosomal                COC2           C1A ACC.CIMP-intermediate
# 6  TCGA-OR-A5J6         ACC                steroid-phenotype-low          CIMP-low               2       miRNA_1       Noisy                COC1           C1B          ACC.CIMP-low

TCGAquery_subtype()
These subtypes will be automatically added in the summarizedExperiment object through GDCprepare. But you can also use the TCGAquery_subtype function to retrieve this information.

brca.subtype <- TCGAquery_subtype(tumor = "brca")
t(brca.subtype[1,])
#                                     [,1]          
# patient                             "TCGA-3C-AAAU"
# Tumor.Type                          "BRCA"        
# Included_in_previous_marker_papers  "NO"          
# vital_status                        "Alive"       
# days_to_birth                       "-20211"      
# days_to_death                       "NA"          
# days_to_last_followup               "4047"        
# age_at_initial_pathologic_diagnosis "55"          
# pathologic_stage                    "NA"          
# Tumor_Grade                         "NA"          
# BRCA_Pathology                      "NA"          
# BRCA_Subtype_PAM50                  "LumA"        
# MSI_status                          "NA"          
# HPV_Status                          "NA"          
# tobacco_smoking_history             "NA"          
# CNV Clusters                        "C6"          
# Mutation Clusters                   "C7"          
# DNA.Methylation Clusters            "C1"          
# mRNA Clusters                       "C1"          
# miRNA Clusters                      "C3"          
# lncRNA Clusters                     "NA"          
# Protein Clusters                    "NA"          
# PARADIGM Clusters                   "C5"          
# Pan-Gyn Clusters                    "NA"

GDCquery_Maf()函數可以支持下載突變數據，這里就暫時不學習了。之后有機會再了解一下。

最后編輯于：2021.12.28 01:21:03

?著作權歸作者所有,轉載或內容合作請聯系作者

人面猴
序言：七十年代末，一起剝皮案震驚了整個濱河市，隨后出現的幾起案子，更是在濱河造成了極大的恐慌，老刑警劉巖，帶你破解...
沈念sama閱讀 227,702評論 6贊 531
死咒
序言：濱河連續發生了三起死亡事件，死亡現場離奇詭異，居然都是意外死亡，警方通過查閱死者的電腦和手機，發現死者居然都...
沈念sama閱讀 98,143評論 3贊 415
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人，你說我怎么就攤上這事。” “怎么了？”我有些...
開封第一講書人閱讀 175,553評論 0贊 373
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長。經常有香客問我，道長，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 62,620評論 1贊 307
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮，結果婚禮上，老公的妹妹穿的比我還像新娘。我一直安慰自己，他們只是感情好，可當我...
茶點故事閱讀 71,416評論 6贊 405
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著，像睡著了一般。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發上，一...
開封第一講書人閱讀 54,940評論 1贊 321
城市分裂傳說
那天，我揣著相機與錄音，去河邊找鬼。笑死，一個胖子當著我的面吹牛，可吹牛的內容都是我干的。我是一名探鬼主播，決...
沈念sama閱讀 43,024評論 3贊 440
雙鴛鴦連環套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長吁一口氣：“原來是場噩夢啊……” “哼！你這毒婦竟也來了？” 一聲冷哼從身側響起，我...
開封第一講書人閱讀 42,170評論 0贊 287
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤，失蹤者是張志新（化名）和其女友劉穎，沒想到半個月后，有當地人在樹林里發現了一具尸體，經...
沈念sama閱讀 48,709評論 1贊 333
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內容為張勛視角年9月15日...
茶點故事閱讀 40,597評論 3贊 354
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時候發現自己被綠了。大學時的朋友給我發了我未婚夫和他白月光在一起吃飯的照片。...
茶點故事閱讀 42,784評論 1贊 369
活死人
序言：一個原本活蹦亂跳的男人離奇死亡，死狀恐怖，靈堂內的尸體忽然破棺而出，到底是詐尸還是另有隱情，我是刑警寧澤，帶...
沈念sama閱讀 38,291評論 5贊 357
?日本核電站爆炸內幕
正文年R本政府宣布，位于F島的核電站，受9級特大地震影響，放射性物質發生泄漏。R本人自食惡果不足惜，卻給世界環境...
茶點故事閱讀 44,029評論 3贊 347
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧，春花似錦、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 34,407評論 0贊 25
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽。三九已至，卻和暖如春，著一層夾襖步出監牢的瞬間，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 35,663評論 1贊 280
情欲美人皮
我被黑心中介騙來泰國打工，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留，地道東北人。一個月前我還...
沈念sama閱讀 51,403評論 3贊 390
代替公主和親
正文我出身青樓，卻偏偏與公主長得像，于是被迫代替她去往敵國和親。傳聞我的和親對象是個殘疾皇子，可洞房花燭夜當晚...
茶點故事閱讀 47,746評論 2贊 370

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

[R]TCGAbiolinks包：數據準備--query、download、prepare

[R]TCGAbiolinks包：數據準備--query、download、prepare

一、查找感興趣的TCGA數據

1、可設置的參數

1.1、根據腫瘤類型

1.2 hg19/hg38

1.3 下載數據類型

GDC harmonized database

GDC Legacy Archive

1.4 樣本標簽Barcode

2、query示例

2.1 膽管癌轉錄組數據 | hg19 | 所有樣本

2.2 肺腺癌癌轉錄組數據 | hg38 | 原位瘤+正常組織

2.3 乳腺癌甲基化數據 | hg19 | Illumina Human Methylation 450平臺

二、根據選擇的query，下載數據

三、讀取已經下載到本地的文件到當前環境

補充：關于病人的臨床數據與腫瘤分型

1、獲取病人的臨床數據

方法一：GDCquery() pipeline

方法二：GDCquery_clinic()

2、獲取病人的腫瘤分型

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

[R]TCGAbiolinks包：數據準備--query、download、prepare

一、查找感興趣的TCGA數據

1、可設置的參數

1.1、根據腫瘤類型

1.2 hg19/hg38

1.3 下載數據類型

GDC harmonized database

GDC Legacy Archive

1.4 樣本標簽Barcode

2、query示例

2.1 膽管癌轉錄組數據 | hg19 | 所有樣本

2.2 肺腺癌癌轉錄組數據 | hg38 | 原位瘤+正常組織

2.3 乳腺癌甲基化數據 | hg19 | Illumina Human Methylation 450平臺

二、根據選擇的query，下載數據

三、讀取已經下載到本地的文件到當前環境

補充：關于病人的臨床數據與腫瘤分型

1、獲取病人的臨床數據

方法一：GDCquery() pipeline

方法二：GDCquery_clinic()

2、獲取病人的腫瘤分型

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频