topGO手冊中的實例實現
手冊地址:http://bioconductor.uib.no/2.7/bioc/vignettes/topGO/inst/doc/topGO.pdf
快速入門部分可以參考:https://rpubs.com/aemoore62/TopGo_colMap_Func_Troubleshoot
1. 導入基因和注釋數據
用library(topGO)
導入topGO
包后,會自動創建三個環境,即GOBPTerm,GOCCterm和GOMFTerm,這些環境是有GO.db包中的GOTERM環境為基礎創建的,以方便載入GO
library(topGO)
BPterms <- ls(GOBPTerm)
head(BPterms)
## [1] "GO:0000001" "GO:0000002" "GO:0000003" "GO:0000011" "GO:0000012" "GO:0000017"
genefilter包
genefilter包用來過濾基因,第一個參數為矩陣或者ExpressionSet對象,第二個參數flist接受一系列函數(經過filterfun合并的),這些函數的參數必須以向量(對應矩陣的每一行)為對象,返回邏輯值,pOverA函數也是genefilter包內置的構造函數,p表示比例,A表示數值,表示必須要有p比例的值超過A
library(genefilter)
ff<- pOverA(p=.1, 10)
flist <- filterfun(ff)
set.seed(2018-4-24)
exprA <- matrix(rnorm(1000, 7), ncol = 10)
ans <- genefilter(exprA,
flist=filterfun(ff,
function(x) any(x<6)))
exprB <- exprA[ans,]
2. topGOdata對象
構建topGOdata對象的3個數據
- 基因某種ID的列表(可以有另一個對應的分數值,如p值或t統計量,或者是差異表達值)
- 基因的這種ID與GO的映射表,在ID為芯片的探針ID時,可以直接使用bioconductor的芯片注釋包如
hgu95av2.db
包 - GO的層次關系數據,這個結果可以從GO.db包獲得,topGO也只支持GO.db包定義的層次結構
構建topGOdata對象的參數:
- ontology:字符串,代表所關注的ontology類別包括“BP”,“MF”或“CC”
- description:字符串,對該研究的簡介
- allGenes:帶名字的vector可以是數值或factor類型,vector的name屬性為基因的某種ID號,這些基因代表所有的基因總數
- geneSelectionFun:根據allGenes的數值選出顯著的目標基因的函數,如果allGenes是數值向量則該參數不可省略,如果是factor(0,1)的向量則不需要指定
- nodeSize:過濾掉一些低富集的GO term,根據nodeSize的值過濾
- annotationFun:注釋函數,把gene的某種ID號映射為GO terms的編號,其選項包括
- annFUN.db:表示從安裝的包如
hgu95av2.db
中獲取對應的注釋- 若為annFUN.db,則還需加參數affyLib,值為芯片注釋包的名稱
- annFUN.db:表示從安裝的包如
- annFUN.org:表示從安裝的包如
org.XXX
中獲取對應的注釋,目前該函數支持Entrez,Genebank,Alias,Ensembl,GeneSymbol,GeneName,UniGene的ID- 若為annFUN.org,則還需加參數如mapping="org.Hs.eg.db", ID="Ensembl"
- annFUN.gene2GO:當用戶提供gene-to-GOs的注釋數據時使用該函數
- 若為annFUN.gene2GO,則還需加參數gene2GO,值為讀入的list變量
- annFUN.GO2gene:當用戶提供GO-to-genes的注釋數據時使用該函數
- 若為annFUN.GO2gene,則還需加參數GO2gene,值為讀入的list變量
- annFUN.file:表示從文件讀取注釋數據如gene2GO文本文件或GO2genes文本文件
- 若為annFUN.file,則還需加file參數,值為相應的ID和GO號(多個GO以逗號隔開)的文本文件路徑
2.1 構建注釋(可以自定義選擇某種證據強度的注釋)
2.1.1 從NCBI下載go annotation,并處理為topGO可識別的list數據
下載壓縮包并解壓:https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2go.gz
該文件包含所有物種的entrzID和GO的對應關系,人的物種編號為“9606”
library(data.table)
geneID2go <- fread('grep "^9606\t" NCBI_data/gene2go')
colnames(geneID2go) <- unlist(read.delim("NCBI_data/gene2go",
header=F, nrow=1,
stringsAsFactors=F))
geneID2go_list <- by(geneID2go$GO_ID, geneID2go$GeneID,
function(x) as.character(x))
head(geneID2go_list)
## $`1`
## [1] "GO:0002576" "GO:0003674" "GO:0005576" "GO:0005576" "GO:0005615"
## [6] "GO:0008150" "GO:0031012" "GO:0031093" "GO:0034774" "GO:0043312"
## [11] "GO:0070062" "GO:0072562" "GO:1904813"
library(topGO)
goID2gene_list <- inverseList(geneID2go_list)
2.1.2 從geneontology官網下載go annotation
下載并解壓:http://geneontology.org/gene-associations/goa_human.gaf.gz
library(data.table)
geneSymbol2go <- fread(
'awk \'{print $3 "," $4}\' gene_ontology_data/goa_human.gaf | grep "GO:"',
header=FALSE, sep=",")
geneSymbol2go_list <- by(geneSymbol2go$V2, geneSymbol2go$V1,
function(x) as.character(x))
head(geneSymbol2go_list)
## $A0A075B6Q4
## [1] "GO:0000056" "GO:0005634" "GO:0030688" "GO:0031902" "GO:0034448" "GO:0042274"
library(topGO)
goSymbol2gene_list <- inverseList(geneSymbol2go_list)
2.1.3 使用實例數據與readMappings函數
readMappings
函數是topGO
包中的函數,讀入文件格式為
068724 GO:0005488, GO:0003774, GO:0001539, GO:0006935, GO:0009288
119608 GO:0005634, GO:0030528, GO:0006355, GO:0045449, GO:0003677, GO:0007275
133103 GO:0015031, GO:0005794, GO:0016020, GO:0017119, GO:0000139
121005 GO:0005576
155158 GO:0005488
注:topGO包中的geneid2go實例數據只有100個基因
file = system.file("examples/geneid2go.map",
package="topGO")
gene2GO_data <- readMappings(file)
2.2 讀入(測序研究或芯片研究中的)所有基因【不帶score】
geneNames <- names(geneSymbol2go_list)
tail(geneNames)
## [1] "ZXDC" "ZYG11A" "ZYG11B" "ZYX" "ZZEF1" "ZZZ3"
假設我們的芯片中所有基因為geneNames向量中的所有基因,所得的差異基因為200個:
set.seed(2018-04-25)
myInterestingGenes <- sample(geneNames, 200)
tail(myInterestingGenes)
## [1] "PARG" "MIOX" "ERGIC2" "CHST10" "FOXR2" "FAM89B"
構建的geneLIst_nscore變量如下,其中geneLIst包括所有的基因,差異基因的值為1,否則為0,其名稱為基因的ID:
geneList_nscore <- factor(as.integer(geneNames %in% myInterestingGenes))
names(geneList_nscore) <- geneNames
tail(geneList_nscore)
## ZXDC ZYG11A ZYG11B ZYX ZZEF1 ZZZ3
## 0 0 0 0 0 0
## Levels: 0 1
2.4 讀入帶score的全部基因
library(ALL)
data(ALL)
library(genefilter)
selectedProbes <- genefilter(ALL, filterfun(pOverA(0.2, log2(100)),
function(x) (IQR(x) > 0.25)))
eset <- ALL[selectedProbes, ]
y <- as.integer((sapply(eset$BT,
function(x) return(substr(x, 1, 1)=="T"))))
y
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [45] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [89] 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 需要安裝multtest包才能自動調用該函數,該函數返回值為FDR
geneList <- getPvalues(exprs(eset), classlabel=y, alternative="two.sided")
其中geneList的值為p值,名稱為探針名;讀入geneList后再構建篩選差異基因的篩選函數
diffGenesFUN <- function(allScore) {
return(allScore < 0.05)
}
x <- diffGenesFUN(geneList)
sum(x)
## [1] 1018
geneList中有1018個是差異基因
2.3 構建topGOdata對象
GOdata_nscore_MF <- new("topGOdata",
ontology="MF",
allGenes=geneList_nscore,
annot=annFUN.gene2GO, gene2GO=geneSymbol2go_list,
nodeSize=5)
GOdata_nscore_MF
##
## ------------------------- topGOdata object -------------------------
##
## Description:
## -
##
## Ontology:
## - MF
##
## 20057 available genes (all genes from the array):
## - symbol: A0A075B6Q4 A0A087WUJ7 A0A087WUU8 A0A087WUV0 A0A087WV48 ...
## - 200 significant genes.
##
## 17630 feasible genes (genes that can be used in the analysis):
## - symbol: A0A087WUJ7 A0A087WUU8 A0A087WUV0 A0A087WV48 A0A087WW49 ...
## - 171 significant genes.
##
## GO graph (nodes with at least 5 genes):
## - a graph with directed edges
## - number of nodes = 1867
## - number of edges = 2501
##
## ------------------------- topGOdata object -------------------------
說明我們構建的topGOdata對象中共有20057個基因,其中差異基因有200個,經注釋的總基因有17630個,經注釋的差異基因有171個
GOdata_score_MF <- new("topGOdata",
ontology = "MF",
allGenes = geneList,
geneSelectionFun = diffGenesFUN,
annot = annFUN.db, affyLib = "hgu95av2.db",
nodeSize = 5)
GOdata_score_MF
##
## ------------------------- topGOdata object -------------------------
##
## Description:
## -
##
## Ontology:
## - MF
##
## 4101 available genes (all genes from the array):
## - symbol: 1000_at 1005_at 1007_s_at 1008_f_at 1009_at ...
## - score : 0.0068656 0.53047 2.148059e-09 1 0.00022923 ...
## - 1018 significant genes.
##
## 3875 feasible genes (genes that can be used in the analysis):
## - symbol: 1000_at 1005_at 1007_s_at 1008_f_at 1009_at ...
## - score : 0.0068656 0.53047 2.148059e-09 1 0.00022923 ...
## - 974 significant genes.
##
## GO graph (nodes with at least 5 genes):
## - a graph with directed edges
## - number of nodes = 993
## - number of edges = 1289
##
## ------------------------- topGOdata object -------------------------
##
對有注釋的和沒有注釋的基因數目進行可視化
allGenes <- featureNames(ALL)
group <- integer(length(allGenes))+1
group[allGenes %in% genes(GOdata_score_MF)] <- 0
group[!selectedProbes] <- 2
group <- factor(group,
labels=c("Used", "Not annotated", "Filtered"))
table(group)
## group
## Used Not annotated Filtered
## 3875 226 8524
pValues <- getPvalues(exprs(ALL), classlabel=y, alternative="two.sided")
geneVar <- apply(exprs(ALL), 1, var)
dd <- data.frame(x = geneVar[allGenes], y=log10(pValues[allGenes]), groups=group)
lattice::xyplot(y~x|group, data=dd, groups=group)
一個理想的圖是Used中的點基本都在右下,其余都在左上,由圖中可以看出被過濾掉的基因有很多都是顯著差異的,因此在實際應用中,過濾過程更[圖片上傳中...(1524667952526.png-4ef403-1524713389307-0)]
加保守
2.4 toGOdata對象的操作
2.4.1 描述
description(GOdata_score_MF)
## [1] "ALL data analysis Object modified on: 18-0425"
description(GOdata_score_MF) <- paste("ALL data analysis.",
"Object modified on:",
format(Sys.time(), "%y-%m%d"))
description(GOdata_score_MF)
## [1] "ALL data analysis. Object modified on: 18-0425"
2.4.2 獲取注釋的基因
head(genes(GOdata_score_MF))
## [1] "1000_at" "1005_at" "1007_s_at" "1008_f_at" "1009_at" "100_g_at"
numGenes(GOdata_score_MF)
## [1] 3875
2.4.3 獲取基因(有GO注釋)的分數(p值)
a <- geneScore(GOdata_score_MF,
whichGenes = names(geneList),
use.names=FALSE)
length(a)
## [1] 3875
length(geneList)
## [1] 4101
2.4.4 更新topGOdata數據
.geneList <- geneScore(GOdata_score_MF, use.names = TRUE)
GOdata_score_MF
##
## ------------------------- topGOdata object -------------------------
##
## Description:
## - ALL data analysis Object modified on: 18-0425
##
## Ontology:
## - MF
##
## 4101 available genes (all genes from the array):
## - symbol: 1000_at 1005_at 1007_s_at 1008_f_at 1009_at ...
## - score : 0.0068656 0.53047 2.148059e-09 1 0.00022923 ...
## - 1018 significant genes.
##
## 3875 feasible genes (genes that can be used in the analysis):
## - symbol: 1000_at 1005_at 1007_s_at 1008_f_at 1009_at ...
## - score : 0.0068656 0.53047 2.148059e-09 1 0.00022923 ...
## - 974 significant genes.
##
## GO graph (nodes with at least 5 genes):
## - a graph with directed edges
## - number of nodes = 993
## - number of edges = 1289
##
## ------------------------- topGOdata object -------------------------
##
updateGenes(GOdata_score_MF, .geneList, diffGenesFUN)
##
## ------------------------- topGOdata object -------------------------
##
## Description:
## - ALL data analysis Object modified on: 18-0425
##
## Ontology:
## - MF
##
## 3875 available genes (all genes from the array):
## - symbol: 1000_at 1005_at 1007_s_at 1008_f_at 1009_at ...
## - score : 0.0068656 0.53047 2.148059e-09 1 0.00022923 ...
## - 974 significant genes.
##
## 3875 feasible genes (genes that can be used in the analysis):
## - symbol: 1000_at 1005_at 1007_s_at 1008_f_at 1009_at ...
## - score : 0.0068656 0.53047 2.148059e-09 1 0.00022923 ...
## - 974 significant genes.
##
## GO graph (nodes with at least 5 genes):
## - a graph with directed edges
## - number of nodes = 993
## - number of edges = 1289
##
## ------------------------- topGOdata object -------------------------
##
2.4.5 GO terms的相關操作
- GO圖信息
graph(GOdata_score_MF)
## A graphNEL graph with directed edges
## Number of Nodes = 993
## Number of Edges = 1289
- 提取所有相關GO
allRelatedGO <- usedGO(GOdata_score_MF)
head(allRelatedGO)
## [1] "GO:0000049" "GO:0000149" "GO:0000166" "GO:0000175" "GO:0000217" "GO:0000287"
- 根據GO ID提取相關基因
selected.terms <- sample(usedGO(GOdata_score_MF), 10)
num.ann.genes <- countGenesInTerm(GOdata_score_MF, selected.terms)
num.ann.genes
## GO:0019209 GO:0000983 GO:0032357 GO:0019213 GO:0004198 GO:0042301 GO:0036459 GO:0001846
## 35 15 5 20 5 5 33 7
## GO:0005125 GO:0032451
## 27 18
ann.genes <- genesInTerm(GOdata_score_MF, selected.terms)
ann.genes[1:2]
## $`GO:0098973`
## [1] "32318_s_at" "34160_at" "AFFX-HSAC07/X00351_3_at"
## [4] "AFFX-HSAC07/X00351_3_st" "AFFX-HSAC07/X00351_5_at" "AFFX-HSAC07/X00351_M_at"
## $`GO:0099106`
## [1] "1158_s_at" "1336_s_at" "155_s_at" "160029_at" "1675_at" "31694_at"
## [7] "31900_at" "31901_at" "32498_at" "32558_at" "32715_at" "32749_s_at"
## [13] "33458_r_at" "34608_at" "34609_g_at" "34759_at" "34981_at" "36900_at"
## [19] "36935_at" "37184_at" "38516_at" "38604_at" "38774_at" "38831_f_at"
## [25] "39010_at" "39011_at" "41143_at" "41288_at" "457_s_at" "755_at"
## [31] "911_s_at" "955_at"
##
scoresInTerm(GOdata_score_MF, selected.terms)[1]
## $`GO:0098973`
## [1] 1.0000000 0.1708372 1.0000000 1.0000000 1.0000000 1.0000000
##
scoresInTerm(GOdata_score_MF, selected.terms, use.names = TRUE)[1]
## $`GO:0098973`
## 32318_s_at 34160_at AFFX-HSAC07/X00351_3_at
## 1.0000000 0.1708372 1.0000000
## AFFX-HSAC07/X00351_3_st AFFX-HSAC07/X00351_5_at AFFX-HSAC07/X00351_M_at
## 1.0000000 1.0000000 1.0000000
##
- 選擇相應的terms對基因及差異基因數目進行統計
termStat(GOdata_score_MF, selected.terms)
## Annotated Significant Expected
## GO:0098973 6 0 1.51
## GO:0099106 32 12 8.04
## GO:0009055 46 11 11.56
## GO:0001104 45 13 11.31
## GO:0005161 5 2 1.26
## GO:0017160 7 3 1.76
## GO:0051536 16 2 4.02
## GO:0016866 6 3 1.51
## GO:0005509 136 45 34.18
## GO:0098772 510 146 128.19
3. 富集統計分析
topGO支持的統計分析分為3類
- 根據基因數目進行的統計分析,只需提供一列基因名稱就能進行統計分析,Fisher精確檢驗,超幾何分布檢驗,二項分布檢驗都屬于這個家族
- 根據基因對應的分數或排序進行的檢驗,包括Kolmogorov-Smirnov檢驗(也稱為GSEA,ks檢驗),Gentleman分類,t檢驗等
- 根據基因表達數據進行統計檢驗,如Goeman全局檢驗等
topGOdata對象可以通過兩種方式運行統計檢驗,第一種可以讓用戶自己定義統計檢驗過程(高級R用戶),第二種更容易但也缺乏更多自定義的操作。
注:weight01是weight和elim的混合算法,默認模式即為weight01
3.1 方式1:定義和進行統計檢驗
進行富集分析的主要函數是getSigGroups()
,包括兩個參數,第一個是topGOdata對象,第二個是groupStats
類或他的衍生類,groupStats類及其子類的關系圖如下:
3.1.1 groupStats
類
使用Fisher精確檢驗計算GO:0046961這個脂代謝term的富集程度,需要先定義所有基因,差異基因
goID <- "GO:0046961"
gene.universe <- genes(GOdata_score_MF)
go.genes <- genesInTerm(GOdata_score_MF, goID)[[1]]
go.genes
## [1] "32444_at" "33875_at" "34889_at" "35770_at" "36028_at" "36167_at" "36994_at" "37367_at"
## [9] "37395_at" "38686_at" "39326_at"
sig.genes <- sigGenes(GOdata_score_MF)
接下來就可以構建groupStats類了,classicCount類是groupStats的一個子類
my.group <- new("classicCount", testStatistic = GOFisherTest, name = "fisher", allMembers = gene.universe,
groupMembers = go.genes, sigMembers = sig.genes)
contTable(my.group)
## sig notSig
## anno 4 7
## notAnno 970 2894
# 可以直接使用runTest函數對groupStats對象進行統計檢驗
runTest(my.group)
## [1] 0.2902583
testStatistic參數代表進行統計量計算的函數,GOFisherTest是topGO包定義的函數,能進行Fisher精確檢驗,用戶能自定義該計算統計量的函數,name是注釋信息
除了ClassicCount類,elimCount類也是groupStats的子類,隨機排除25%的注釋
set.seed(2018 - 4 - 25)
elim.genes <- sample(go.genes, length(go.genes)/4)
elim.group <- new("elimCount", testStatistic = GOFisherTest, name = "fisher_elim", allMembers = gene.universe,
groupMembers = go.genes, sigMembers = sig.genes, elim = elim.genes)
contTable(elim.group)
## sig notSig
## anno 3 6
## notAnno 970 2894
# 可以直接使用runTest函數對groupStats對象進行統計檢驗
runTest(elim.group)
## [1] 0.1682705
注:groupStats類并不依賴于GO(由我們傳入的參數即可知道不管是GO還是KEGG等其他類型的集合均可適用)
3.1.2 進行假設檢驗
如果只適用基因列表進行假設檢驗則構建classicCount或elimCount對象
test.stat <- new("classicCount", testStatistic=GOFisherTest,
name="Fisher test")
resultFisher <- getSigGroups(GOdata_score_MF, test.stat)
resultFisher
##
## Description: ALL data analysis Object modified on: 18-0425
## Ontology: MF
## 'classic' algorithm with the 'Fisher test' test
## 993 GO terms scored: 50 terms with p < 0.01
## Annotation data:
## Annotated genes: 3875
## Significant genes: 974
## Min. no. of genes annotated to a GO: 5
## Nontrivial nodes: 905
如果同時適用score,則需構建classicScore或elimScore對象(如上圖所示)
test.stat <- new("classicScore", testStatistic = GOKSTest, name = "KS test")
resultKS <- getSigGroups(GOdata_score_MF, test.stat)
resultKS
##
## Description: ALL data analysis Object modified on: 18-0425
## Ontology: MF
## 'classic' algorithm with the 'KS test' test
## 993 GO terms scored: 65 terms with p < 0.01
## Annotation data:
## Annotated genes: 3875
## Significant genes: 974
## Min. no. of genes annotated to a GO: 5
## Nontrivial nodes: 993
test.stat <- new("elimScore", testStatistic = GOKSTest, name = "KS test elim")
resultKSElim <- getSigGroups(GOdata_score_MF, test.stat)
resultKSElim
##
## Description: ALL data analysis Object modified on: 18-0425
## Ontology: MF
## 'elim' algorithm with the 'KS test elim : 0.01' test
## 993 GO terms scored: 32 terms with p < 0.01
## Annotation data:
## Annotated genes: 3875
## Significant genes: 974
## Min. no. of genes annotated to a GO: 5
## Nontrivial nodes: 993
注意選擇的class與檢驗方法的兼容性,如上圖所示,weight不能與score相關的檢驗兼容,而與Fisher test兼容
test.stat <- new("weightCount", testStatistic = GOFisherTest,
name = "Fisher test", sigRatio = "ratio")
resultFisherWeight <- getSigGroups(GOdata_score_MF, test.stat)
resultFisherWeight
##
## Description: ALL data analysis Object modified on: 18-0425
## Ontology: MF
## 'weight' algorithm with the 'Fisher test : ratio' test
## 993 GO terms scored: 20 terms with p < 0.01
## Annotation data:
## Annotated genes: 3875
## Significant genes: 974
## Min. no. of genes annotated to a GO: 5
## Nontrivial nodes: 905
3.2 P值是否校正的問題
注意檢驗所得p值為原始p值,未經過多重檢驗校正,不校正的原因如下:
- 在很多情況下,富集分析得到的p值的分布可能不太極端,在這些情況下FDR或FWER校正方法產生較為保守的p值,導致沒有“顯著”的p值,丟失重要的GO terms及相關信息,在這種情況下,研究者往往關注GO terms的排序,而不是它們是否有一個顯著的FDR。
- 富集分析包括了多個步驟和許多假設,如對GO terms進行的Fisher精確檢驗。進行多重檢驗校正遠遠不足以控制誤差率。
- 對elim和weight的檢驗方法來說,多重檢驗校正的方法變得更加不可靠,因為用這些方法就算的p值是依賴于相鄰的GO term的,而多重檢驗校正的前提假設是這些檢驗都是獨立的。
3.3 更高層面的統計檢驗(用戶友好)
用runTest函數可以很快速的進行統計檢驗,統計方法的選擇通過algorithm(默認為weigth01)和statistc參數決定,如
resultFis <- runTest(GOdata_score_MF,
algorithm="classic",
statistic="fisher")
weight01.fisher <- runTest(GOdata_score_MF,
statistic = "fisher")
weight01.t <- resultt<- runTest(GOdata_score_MF,
algorithm="weight01",
statistic="t")
elim.ks <- resultt<- runTest(GOdata_score_MF,
algorithm="elim",
statistic="ks")
可用的參數如下(也可以看上面的圖),注意有些組合不兼容
whichTests()
## [1] "fisher" "ks" "t" "globaltest" "sum" "ks.ties"
whichAlgorithms()
## [1] "classic" "elim" "weight" "weight01" "lea" "parentchild"
4. 富集結果及可視化
4.1 topGOresult對象
toGOresult對象非常簡單,只有p值或統計量(統稱為score),score函數并沒有對返回值進行排序
pvalFis <- score(resultFis)
head(pvalFis)
## GO:0000049 GO:0000149 GO:0000166 GO:0000175 GO:0000217 GO:0000287
## 0.9383324 0.9779475 0.3904532 0.8241708 0.3247581 0.3125296
可用統計這些score的分布
hist(pvalFis, 100, xlab="p-values")
score函數還有一個參數whichGO,可以指定GO ID
pvalWeight <- score(resultFisherWeight, whichGO = names(pvalFis))
head(pvalWeight)
## GO:0000049 GO:0000149 GO:0000166 GO:0000175 GO:0000217 GO:0000287
## 0.9383324 0.9806822 0.7372081 0.8241708 0.3247581 0.3125296
可以看一下不同方法的結果相關性:
cor(pvalFis, pvalWeight)
## [1] 0.6590391
plot(pvalFis, pvalWeight,
xlab = "p-value classic", ylab = "p-value elim",
pch = 19, cex = gSize, col = 1:2)
也可以對結果進行簡單統計(總共注釋的基因,注釋的差異基因,最少包含的基因數,包含差異基因的GO term)
geneData(resultFisherWeight)
## Annotated Significant NodeSize SigTerms
## 3875 974 5 905
4.2 匯總結果
使用GenTable函數可以對結果進行匯總,參數為toGOdata和toGOresult,及制定排序的列及包含的條目
allRes <- GenTable(GOdata_score_MF, classic = resultFis,
KS = resultKS, weight = resultFisherWeight,
orderBy = "weight",
ranksOf = "classic", topNodes = 20)
4.3 分析單個GO term
最直觀地查看某個GO term是否有差異基因富集的方法就是觀察score的密度分布
## 選擇第一個GO term
goID <- allRes[1, "GO.ID"]
print(showGroupDensity(GOdata_score_MF, goID, ranks=TRUE))
如圖所示,橫坐標表示由score產生的排序位置,縱坐標表示密度
上圖表示注釋到該特定GO中的所有基因的密度分布,下圖表示除了上圖基因之外的其他基因的密度分布,該特定GO中的基因大多分布在p值較低的位置區域,而下方的圖表示p值的分布基本都在一個比較均勻的水平,說明富集比較顯著
另一個比較方便的功能是把該GO中的所有基因及其注釋信息和p值匯總成表,可以使用whichTerms參數指定GO terms,如果有過個GO則返回一個含有dataframe的list,還可以傳遞file參數指定輸出文件(注:只有該芯片有注釋包時才能使用該函數,其余自定義的注釋不能使用該函數)
gt <- printGenes(GOdata_score_MF,
whichTerms = goID,
chip = affyLib,
numChar = 40)
4.4 可視化GO層級結構
兩個函數實現,一是showSigOfNodes,firstSigNodes表示指定的顯著節點數目,useInfo表示每個節點的信息顯示包括“def”(GOid及定義文字)和“all”(GOid定義文字,score和注釋數目【包括該term的基因在總基因中國的數目,和該term的基因在差異基因中的數目】),其中顯著富集的GO用長方形表示,黑色箭頭代表is_a關系,紅色箭頭代表part_a關系,節點的顏色代表其顯著性的程度。
showSigOfNodes(GOdata_score_MF,
score(resultFisherWeight),
firstSigNodes = 5,
useInfo = 'def')
printGraph函數自動輸出為pdf文件保存在當前目錄,fn.prefix表示輸出文件的前綴文字
printGraph(GOdata_score_MF, resultFis,
firstSigNodes = 10,
fn.prefix = "tGO",
useInfo = "all",
pdfSW = TRUE)
本次GO分析的缺陷:
1.對探針進行GO分析而不是基因名進行GO分析,這樣會導致相應GO term的counts數目不正確,如一個探針對應多個基因的情況
- 可以通過構建限定level和證據級別的GO注釋對象,從而限定level等級和證據等級進行富集分析