因為樣本數量比較可觀,所以可以進行WGCNA分析。這里是并不需要選取所有的基因來做WGCNA分析,挑選的標準可以是top變異程度大的基因集合,或者顯著差異表達的基因集合等等。
這里可以參考:<https://github.com/jmzeng1314/my_WGCNA>
WGCNA將lncRNA分成18個模塊(3635個lncRNA),空間模塊中lncRNA表達呈現明顯的組織區域特異性,如:CB (M1, 794個lncRNAs),DG/CA1 (M2, 443個lncRNAs), CA1 (M4, 369個lncRNAs),neocortex (M7, 123個lncRNAs)和OC (M10,57個lncRNAs)。
時間模塊中lncRNA表達與年齡有關,而與組織區域不明顯;性別模塊中lncRNA表達與性別和年齡都相關。
每個模塊就必須做pathway/go等數據庫的注釋分析咯!
資料收集:
google搜索或在生信技能樹和生信菜鳥團搜索WGCNA ,能找到很多教程,下面列出幾個中文教程和英文教程,強烈推薦中文教程1和英文教程3。
- WGCNA Background and glossary
- Data input and cleaning
- Network construction and module detection
- Relating modules to external information and identifying important genes
- Interfacing network analysis with other data such as functional annotation and gene ontology
- Network visualization using WGCNA functions
- Exporting a gene network to external visualization software
背景知識
基本概念
WGCNA(weighted correlation network analysis)加權基因共表達網絡分析, 用于提取與性狀或臨床特征相關的基因模塊,解析基礎代謝途徑,轉錄調控途徑、翻譯水平調控等生物學過程。
WGCNA適合于復雜的數據模式,推薦5組以上的數據,如:
不同器官、組織類型發育調控;
同一組織不同時期發育調控;
非生物脅迫不同時間點應答;
病原物侵染后不同時間點應答。
基本步驟:
WGCNA分為表達量聚類分析和表型關聯兩部分,具體步驟包括基因之間相關系數的計算,共表達網絡的構建,篩選特定模塊,模塊與性狀關聯,核心基因的篩選。
術語:
Co-expression weighted network: 是一個無向有權重(undirected, weighted)的網絡。“無權重(unweighted network)”,基因與基因之間的相關度只能是0或者1,0表示兩個基因沒有聯系,而1表示有。“有權重(weighted network)”基因之間不僅僅是相關與否,還記錄著它們的相關性數值,數值就是基因之間的聯系的權重(相關性)。
Module:(模塊)指表達模式相似的基因聚為一類,這樣的一類基因稱為模塊。
Connectivity:指一個基因與網絡中其他基因的相關性程度。
Eigengene(eigen- +? gene):基因和樣本構成的矩陣
Module eigengene E: 一個模塊中的主成分
Hub gene:
Gene signicance GS:
Module signicance:
分析流程
WGCNA輸入文件需要一個表達矩陣,最好是RPKM或其他歸一化好的表達量;同時需要提供臨床信息或者其它表型信息。
STEP1: 輸入數據的準備
表達矩陣可以從作者GitHub下載 https://github.com/DChenABLife/RhesusLncRNA,這里只下載lncRNA的表達矩陣(https://github.com/DChenABLife/RhesusLncRNA/blob/master/All_sample_LncRNA_exp_RPKM.xlsx), 因為這里的表達矩陣文件是Excel格式的,需要轉為csv格式方便后續用R處理,可以直接打開這個excel文件,然后另存為csv格式即可。
讀入原始表達數據
原始數據包含64個樣本,9904個lncRNA表達量,其中第一列是lncRNA_ID,第66列是Annotation。
setwd("G:/My_exercise/WGCNA/")
lncRNAexpr <- read.csv("All_sample_LncRNA_exp_RPKM.csv",sep=",",header = T)
head(lncRNAexpr)
dim(lncRNAexpr)
#[1] 9904 66
重命名數據列表,行名和列名
##去掉Annotation這列
fpkm <- lncRNAexpr[,-66]
head(fpkm)
##重命名行名和列名
rownames(fpkm)=fpkm[,1]
fpkm=fpkm[,-1]
fpkm[1:4,1:4]
準備表型信息
這里有64個樣本,包含獼猴腦不同空間區域,不同發育時期,以及性別,因為每個樣本都交叉包含著三種不同的信息,如果選擇全部表型信息,我試了試,后續的模塊和性狀完全看不清關系,所以我這里僅選擇腦不同區域的表型信息,包括CB、DG、PFC、PCC、CA1、OC、PC、TC。
##Sample Info
subname=sapply(colnames(fpkm),function(x) strsplit(x,"_")[[1]][1])
datTraits = data.frame(gsm=names(fpkm),
subtype=subname)
rownames(datTraits)=datTraits[,1]
head(datTraits)
下載并載入WGCNA包
source("http://bioconductor.org/biocLite.R")
#biocLite(c("AnnotationDbi", "impute", "GO.db", "preprocessCore")) ##如果已經下載過了,這里就不用下載
biocLite("WGCNA")
library(WGCNA)
行列轉置
WGCNA針對的是基因進行聚類,而一般我們的聚類是針對樣本用hclust即可,也就是說要轉置為行名是sample,列名是gene。
RNAseq_voom <- fpkm
WGCNA_matrix = t(RNAseq_voom[order(apply(RNAseq_voom,1,mad), decreasing = T)[1:5000],])
datExpr <- WGCNA_matrix ## top 5000 mad genes
datExpr[1:4,1:4]
確定臨床表型與樣本名字
sampleNames = rownames(datExpr);
traitRows = match(sampleNames, datTraits$gsm)
rownames(datTraits) = datTraits[traitRows, 1]
datExpr和datTraits準備好后,接下來就是構建基因網絡,鑒定模塊。網絡構建有三種方法:1)一步法構建網絡;2)多步法構建網絡;3)block-wise構建網絡(主要針對大數據集)。下面的介紹的步驟是一步法構建網絡。
STEP2:確定最佳soft-thresholding power
選擇合適“軟閥值(soft thresholding power)”beta
Constructing a weighted gene network entails the choice of the soft thresholding power to which co-expression similarity is raised to calculate adjacency.
用到的函數是pickSoftThreshold
# Choose a set of soft-thresholding powers
powers = c(c(1:10), seq(from = 12, to=20, by=2))
# Call the network topology analysis function
sft = pickSoftThreshold(datExpr, powerVector = powers, verbose = 5)
# Plot the results:
par(mfrow = c(1,2));
cex1 = 0.9;
# Scale-free topology fit index as a function of the soft-thresholding power
plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n",
main = paste("Scale independence"));
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
labels=powers,cex=cex1,col="red");
# this line corresponds to using an R^2 cut-off of h
abline(h=0.90,col="green")
# Mean connectivity as a function of the soft-thresholding power
plot(sft$fitIndices[,1], sft$fitIndices[,5],
xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
main = paste("Mean connectivity"))
text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=cex1,col="red")
#選擇beta值
best_beta=sft$powerEstimate
#> best_beta
[1] 3
最佳beta值是3。
STEP3: 一步法構建共表達矩陣
一步法構建網絡,power=sft$powerEstimate=3,mergeCutHeight是合并模塊閾值的一個參數。
net = blockwiseModules(datExpr, power = sft$powerEstimate,
maxBlockSize = 6000,TOMType = "unsigned",
minModuleSize = 30,reassignThreshold = 0, mergeCutHeight = 0.25,
numericLabels = TRUE, pamRespectsDendro = FALSE,
saveTOMs = TRUE,
saveTOMFileBase = "AS-green-FPKM-TOM",
verbose = 3)
STEP4:模塊鑒定及可視化
模塊鑒定
table(net$colors) 可以看總共有多少模塊,每個模塊的大小,這里共有9個模塊,從1-9每個模塊的大小是遞減的,從2254-115,0表示這些基因不在所有模塊內。
table(net$colors)
可視化
# Convert labels to colors for plotting
mergedColors = labels2colors(net$colors)
table(mergedColors)
# Plot the dendrogram and the module colors underneath
plotDendroAndColors(net$dendrograms[[1]], mergedColors[net$blockGenes[[1]]],
"Module colors",
dendroLabels = FALSE, hang = 0.03,
addGuide = TRUE, guideHang = 0.05)
## assign all of the gene to their corresponding module
## hclust for the genes.
#明確樣本數和基因數
nGenes = ncol(datExpr)
nSamples = nrow(datExpr)
#首先針對樣本做個系統聚類樹
datExpr_tree<-hclust(dist(datExpr), method = "average")
par(mar = c(0,5,2,0))
plot(datExpr_tree, main = "Sample clustering", sub="", xlab="", cex.lab = 2,
cex.axis = 1, cex.main = 1,cex.lab=1)
## 如果這個時候樣本是有性狀,或者臨床表型的,可以加進去看看是否聚類合理
#針對前面構造的樣品矩陣添加對應顏色
sample_colors <- numbers2colors(as.numeric(factor(datTraits$subtype)),
colors = c("grey","blue","red","green"),signed = FALSE)
## 這個給樣品添加對應顏色的代碼需要自行修改以適應自己的數據分析項目。
# sample_colors <- numbers2colors( datTraits ,signed = FALSE)
## 如果樣品有多種分類情況,而且 datTraits 里面都是分類信息,那么可以直接用上面代碼,
##當然,這樣給的顏色不明顯,意義不大。
#構造10個樣品的系統聚類樹及性狀熱圖
par(mar = c(1,4,3,1),cex=0.8)
plotDendroAndColors(datExpr_tree, sample_colors,
groupLabels = colnames(sample),
cex.dendroLabels = 0.8,
marAll = c(1, 4, 3, 1),
cex.rowText = 0.01,
main = "Sample dendrogram and trait heatmap")
STEP5:模塊和性狀的關系
design=model.matrix(~0+ datTraits$subtype)
colnames(design)=levels(datTraits$subtype)
moduleColors <- labels2colors(net$colors)
# Recalculate MEs with color labels
MEs0 = moduleEigengenes(datExpr, moduleColors)$eigengenes
MEs = orderMEs(MEs0); ##不同顏色的模塊的ME值矩陣(樣本vs模塊)
moduleTraitCor = cor(MEs, design , use = "p");
moduleTraitPvalue = corPvalueStudent(moduleTraitCor, nSamples)
sizeGrWindow(10,6)
# Will display correlations and their p-values
textMatrix = paste(signif(moduleTraitCor, 2), "\n(",
signif(moduleTraitPvalue, 1), ")", sep = "");
dim(textMatrix) = dim(moduleTraitCor)
par(mar = c(6, 8.5, 3, 3));
# Display the correlation values within a heatmap plot
labeledHeatmap(Matrix = moduleTraitCor,
xLabels = names(design),
yLabels = names(MEs),
ySymbols = names(MEs),
colorLabels = FALSE,
colors = greenWhiteRed(50),
textMatrix = textMatrix,
setStdMargins = FALSE,
cex.text = 0.5,
zlim = c(-1,1),
main = paste("Module-trait relationships"))
圖中第二列第五行,即CB/turquoise相關性有0.97,pvalue=1e-41,后續分析可以挑選這個模塊。
每一列對應的樣本特征可以從design這里查看。
STEP6:感興趣性狀的模塊的具體基因分析
下面就是對CB/turquoise這個模塊具體分析:
首先計算模塊與基因的相關性矩陣
# names (colors) of the modules
modNames = substring(names(MEs), 3)
geneModuleMembership = as.data.frame(cor(datExpr, MEs, use = "p"));
## 算出每個模塊跟基因的皮爾森相關系數矩陣
## MEs是每個模塊在每個樣本里面的值
## datExpr是每個基因在每個樣本的表達量
MMPvalue = as.data.frame(corPvalueStudent(as.matrix(geneModuleMembership), nSamples));
names(geneModuleMembership) = paste("MM", modNames, sep="");
names(MMPvalue) = paste("p.MM", modNames, sep="");
再計算性狀與基因的相關性矩陣
## 只有連續型性狀才能只有計算
## 這里把是否屬于 CB 表型這個變量用0,1進行數值化。
CB = as.data.frame(design[,2]);
names(CB) = "CB"
geneTraitSignificance = as.data.frame(cor(datExpr, CB, use = "p"));
GSPvalue = as.data.frame(corPvalueStudent(as.matrix(geneTraitSignificance), nSamples));
names(geneTraitSignificance) = paste("GS.", names(CB), sep="");
names(GSPvalue) = paste("p.GS.", names(CB), sep="")
最后把兩個相關性矩陣聯合起來,指定感興趣模塊進行分析
module = "turquoise"
column = match(module, modNames);
moduleGenes = moduleColors==module;
sizeGrWindow(7, 7);
par(mfrow = c(1,1));
verboseScatterplot(abs(geneModuleMembership[moduleGenes, column]),
abs(geneTraitSignificance[moduleGenes, 1]),
xlab = paste("Module Membership in", module, "module"),
ylab = "Gene significance for CB",
main = paste("Module membership vs. gene significance\n"),
cex.main = 1.2, cex.lab = 1.2, cex.axis = 1.2, col = module)
上圖可以看出基因跟其對應的性狀高度相關,可以導出做個GO/KEGG注釋,看看這些基因的具體功能。
STEP7:網絡的可視化
#首先針對所有基因畫熱圖
nGenes = ncol(datExpr)
nSamples = nrow(datExpr)
geneTree = net$dendrograms[[1]];
dissTOM = 1-TOMsimilarityFromExpr(datExpr, power = sft$powerEstimate); # 設置power=sft$powerEstimate,最佳beta值,此處是3
plotTOM = dissTOM^7;
diag(plotTOM) = NA;
#TOMplot(plotTOM, geneTree, moduleColors, main = "Network heatmap plot, all genes")
#然后隨機選取部分基因作圖
nSelect = 400
# For reproducibility, we set the random seed
set.seed(10);
select = sample(nGenes, size = nSelect);
selectTOM = dissTOM[select, select];
# There’s no simple way of restricting a clustering tree to a subset of genes, so we must re-cluster.
selectTree = hclust(as.dist(selectTOM), method = "average")
selectColors = moduleColors[select];
# Open a graphical window
sizeGrWindow(9,9)
# Taking the dissimilarity to a power, say 10, makes the plot more informative by effectively changing
# the color palette; setting the diagonal to NA also improves the clarity of the plot
plotDiss = selectTOM^7;
diag(plotDiss) = NA;
TOMplot(plotDiss, selectTree, selectColors, main = "Network heatmap plot, selected genes")
#最后畫模塊和性狀的關系
# Recalculate module eigengenes
MEs = moduleEigengenes(datExpr, moduleColors)$eigengenes
## 只有連續型性狀才能只有計算
## 這里把是否屬于 Luminal 表型這個變量用0,1進行數值化。
CB = as.data.frame(design[,2]);
names(CB) = "CB"
# Add the weight to existing module eigengenes
MET = orderMEs(cbind(MEs, CB))
# Plot the relationships among the eigengenes and the trait
sizeGrWindow(5,7.5);
par(cex = 0.9)
plotEigengeneNetworks(MET, "", marDendro = c(0,4,1,2), marHeatmap = c(3,4,1,2), cex.lab = 0.8, xLabelsAngle
= 90)
# Plot the dendrogram
sizeGrWindow(6,6);
par(cex = 1.0)
## 模塊的聚類圖
plotEigengeneNetworks(MET, "Eigengene dendrogram", marDendro = c(0,4,2,0),
plotHeatmaps = FALSE)
# Plot the heatmap matrix (note: this plot will overwrite the dendrogram plot)
par(cex = 1.0)
## 性狀與模塊熱圖
plotEigengeneNetworks(MET, "Eigengene adjacency heatmap", marHeatmap = c(3,4,2,2),
plotDendrograms = FALSE, xLabelsAngle = 90)
STEP8:提取指定模塊的基因名
提取基因信息,進行下游分析包括GO/KEGG等功能數據庫的注釋。
# Select module
module = "turquoise";
# Select module probes
probes = colnames(datExpr) ## 我們例子里面的probe就是基因名
inModule = (moduleColors==module);
modProbes = probes[inModule];
GO分析
不知道這是gene ID類型,該怎么做注釋呢???
STEP9: 模塊的導出
主要模塊里面的基因直接的相互作用關系信息可以導出到cytoscape,VisANT等網絡可視化軟件。
# Recalculate topological overlap
TOM = TOMsimilarityFromExpr(datExpr, power = sft$powerEstimate);
# Select module
module = "turquoise";
# Select module probes
probes = colnames(datExpr) ## 我們例子里面的probe就是基因名
inModule = (moduleColors==module);
modProbes = probes[inModule];
## 也是提取指定模塊的基因名
# Select the corresponding Topological Overlap
modTOM = TOM[inModule, inModule];
dimnames(modTOM) = list(modProbes, modProbes)
模塊對應的基因關系矩陣
首先是導出到VisANT
vis = exportNetworkToVisANT(modTOM,
file = paste("VisANTInput-", module, ".txt", sep=""),
weighted = TRUE,
threshold = 0)
然后是導出到cytoscape
cyt = exportNetworkToCytoscape(
modTOM,
edgeFile = paste("CytoscapeInput-edges-", paste(module, collapse="-"), ".txt", sep=""),
nodeFile = paste("CytoscapeInput-nodes-", paste(module, collapse="-"), ".txt", sep=""),
weighted = TRUE,
threshold = 0.02,
nodeNames = modProbes,
nodeAttr = moduleColors[inModule]
);
STEP10: 模塊內的分析—— 提取hub genes
- hub genes指模塊中連通性(connectivity)較高的基因,如設定排名前30或前10%)。
- 高連通性的Hub基因通常為調控因子(調控網絡中處于偏上游的位置),而低連通性的基因通常為調控網絡中偏下游的基因(例如,轉運蛋白、催化酶等)。
HubGene: |kME| >=閾值(0.8)
模塊特征相關的概念ME/kME/kIN :
(1)計算連通性
### Intramodular connectivity, module membership, and screening for intramodular hub genes
# (1) Intramodular connectivity
# moduleColors <- labels2colors(net$colors)
connet=abs(cor(datExpr,use="p"))^6
Alldegrees1=intramodularConnectivity(connet, moduleColors)
head(Alldegrees1)
(2)模塊內的連通性與gene顯著性的關系
# (2) Relationship between gene significance and intramodular connectivity
which.module="black"
EB= as.data.frame(design[,2]); # change specific
names(EB) = "EB"
GS1 = as.numeric(cor(EB,datExpr, use="p"))
GeneSignificance=abs(GS1)
colorlevels=unique(moduleColors)
sizeGrWindow(9,6)
par(mfrow=c(2,as.integer(0.5+length(colorlevels)/2)))
par(mar = c(4,5,3,1))
for (i in c(1:length(colorlevels)))
{
whichmodule=colorlevels[[i]];
restrict1 = (moduleColors==whichmodule);
verboseScatterplot(Alldegrees1$kWithin[restrict1],
GeneSignificance[restrict1], col=moduleColors[restrict1],
main=whichmodule,
xlab = "Connectivity", ylab = "Gene Significance", abline = TRUE)
}
(3)計算模塊內所有基因的連通性, 篩選hub genes
abs(GS1)> .9 可以根據實際情況調整參數
abs(datKME$MM.black)>.8 至少大于 >0.8
###(3) Generalizing intramodular connectivity for all genes on the array
datKME=signedKME(datExpr, MEs, outputColumnName="MM.")
# Display the first few rows of the data frame
head(datKME)
##Finding genes with high gene significance and high intramodular connectivity in
# interesting modules
# abs(GS1)> .9 可以根據實際情況調整參數
# abs(datKME$MM.black)>.8 至少大于 >0.8
FilterGenes= abs(GS1)> .9 & abs(datKME$MM.black)>.8
table(FilterGenes)
STEP11: 其他分析
(1) another plot for realtionship between module eigengenes
plotMEpairs(MEs,y=datTraits$cellType)
(2) Diagnostics: heatmap plots of module expression
sizeGrWindow(8,9)
#par(mfrow=c(3,1), mar=c(1, 2, 4, 1))
# for black module
which.module="blue";
plotMat(t(scale(datExpr[,moduleColors==which.module ]) ),nrgcols=30,rlabels=T,
clabels=T,rcols=which.module,
title=which.module )
(3) Diagnostics: displaying module heatmap and the eigengene
sizeGrWindow(8,7);
which.module="blue"
ME=MEs[, paste("ME",which.module, sep="")]
par(mfrow=c(2,1), mar=c(0.3, 5.5, 3, 2))
plotMat(t(scale(datExpr[,moduleColors==which.module ]) ),
nrgcols=30,rlabels=F,rcols=which.module,
main=which.module, cex.main=2)
par(mar=c(5, 4.2, 0, 0.7))
barplot(ME, col=which.module, main="", cex.main=2,
ylab="eigengene expression",xlab="MPP")
感謝Jimmy師兄和Aaron Li_bioinformatics師兄在學習過程中的指點。