今天介紹和測試這款關聯ChIP-seq、RNA-seq分析軟件,BETA(Binding and Expression Target Analysis)。
該分析整合了ChIP-seq和轉錄表達水平來探究基因表達調控的機理。此前,一些研究用于轉錄因子靶基因預測,但是鮮有高效的工具供人們使用。
作者開發了名為BETA的軟件來研究結合位點與轉錄表達的關系,基于結合位點信息和差異表達信息可以進行如下三個主要分析:
(1)用于預測轉錄因子就有激活或抑制的功能;
(2)用于推斷轉錄因子的靶基因;
(3)用于鑒定轉錄因子的motif及其結合者。
該軟件具有三個主要的子命令:
第一部分:
BETA basic: 轉錄因子激活或抑制功能預測和直接靶基因檢測。
BETA basic \
–p peaks.bed \
–e expr.xls \
–k BSF \
–g hg19 \
--da 500 \
–n prefix \
-o outdir
涉及參數及其他 可選參數解釋如下:
-p peaks結合位點文件,BED格式,支持3列或5列格式,其格式為:
CHROM, START, END(NAME, SCORE)。
使用基本的3列信息即可。
示例如下(不包含表頭):
chrom start end name(可選) score(可選)
chr1 1208689 1209509 AR_LNCaP_2 51.58
chr1 1334246 1335348 AR_LNCaP_7 54.55
chr1 2179351 2180790 AR_LNCaP_9 257.72
注:macs2檢峰時刻通過p-value進行peak的過濾只保留10000個結果。
-e 差異表達文件,可以使用LIMMA和Cuffdiff的結果,也可使用BETA specific format(BSF)格式,其格式如下:
GeneID log2(FC) FDR/pval
NM_000015 1.28221 0.185744
NM_000016 0 1
NM_000017 0.0598207 0.844755
-k 差異表達文件的類型,即LIM, CUF, BSF, O,分別對應LIMMA、cuffdiff標準輸出格式、BSF格式或其他格式的文件需要使用--info指定。
-g 基因組,支持任何小鼠,hg38, hg19, hg18, mm10, mm9,也可使用-r指定其他基因組文件。
--gname2 給定次參數說明文件中的基因名稱或轉錄本名稱是官方的基因symbols,默認FALSE。
--info 指定表達數據中的geneID, up/down status和statistcal values列。
-r 參考基因組文件。
-o 輸出路徑。
--bl 是否使用CTCF邊界過濾基因周圍的peaks,默認FALSE。
--bf CTCF保守peakBED文件,需要同時指定--bl參數且基因組不能指定為hg19和mm9。
--pn 打算分析的peaks數目,默認10000。
--method 支持score和distance,指定TF/CR給你預測的方法,score:調控潛能的打分,disrance:鄰近的結合peak的距離。
-n 輸出文件前綴名稱。
-d 從TSS一定范圍內獲取peaks,默認100000(100kb)。
--df 輸入一個0~1范圍內的閾值來篩選最顯著的差異表達基因,默認1,全部輸入基因。
--da 通過比例或數目選取最顯著的差異表達基因,默認0.5。如果想使用diff_fdr請設置此參數為1。
-c 設置0~1閾值,通過p值(單尾KS檢驗)選擇最近的靶基因,默認0.001。
測試運行日志如下:
BETA basic \
-p peaks.bed \
-e beta_specific_expression.txt \
-k BSF \
-g hg19 \
--da 500 \
-n basic \
-o basic_out
[15:41:26] Argument List:
[15:41:26] Name = basic
[15:41:26] Peak File = ../peaks.bed
[15:41:26] Top Peaks Number = 10000
[15:41:26] Distance = 100000 bp
[15:41:26] Genome = hg19
[15:41:26] Expression File = ../beta_specific_expression.txt
[15:41:26] BETA specific Expression Type
[15:41:26] Number of differential expressed genes = 500.0
[15:41:26] Differential expressed gene FDR Threshold = 1
[15:41:26] Up/Down Prediction Cutoff = 0.001000
[15:41:26] Function prediction based on regulatory potential
[15:41:26] Check ../peaks.bed successfully!
[15:41:26] #ID Status Value
is the header of the expression file
[15:41:26] Checking the differential expression infomation...
[15:41:26] Take the first line with Differential Information as an example: NM_000014 -0.325878 0.618293
[15:41:26] Differential Expression file format successful passed
[15:41:26] You do not like filter peak by CFCT boundary, it will be filtered only by the distance
[15:41:26] Read file <../peaks.bed> OK! All <7031> peaks.
[15:41:41] Process <50801> genes
[15:41:56] Finished! Preliminary results saved into temporary file: <basic.txt>
[11815, 11643]
[15:41:56] Genes were seprated to two parts: up regulated and down regulated.
[15:41:56] Prepare file for the Up/Down Test
null device
1
[15:42:02] Finished, Find the result in basic_function_prediction.pdf
[15:42:02] Get the Rank Product of the "upregulate" genes
[15:42:46] Get the Rank Product of the "downregulate" genes
['"upregulate"', '"downregulate"']
[15:43:30] pick out the peaks 100000bp around the selected genes
[15:43:30] Finished: Find target gene associated peaks in basic_out
total time: 0:2:4
輸出文件存儲于basic_out文件夾:
$tree
.
├── basic_downtarget_associate_peaks.bed
├── basic_downtarget.txt
├── basic_function_prediction.pdf
├── basic_function_prediction.R
├── basic_uptarget_associate_peaks.bed
└── basic_uptarget.txt
- 激活/抑制功能預測打分
- 上調調控靶標/下調調控靶標文件
#文件截取示例
Chroms txStart txEnd refseqID rank product Strands GeneSymbol
chr1 27992571 27998724 NM_002038 4.592e-06 - IFI6
chr12 113376237 113411055 NM_006187 4.835e-06 + OAS3
chr12 113344738 113357712 NM_016816 5.523e-06 + OAS1
chr4 169277885 169401665 NM_001012967 5.616e-06 - DDX60L
- 靶基因相關的peak文件
hrom pStart pEnd Refseq Symbol Distance Score
chr1 27971570 27971739 NM_002038 IFI6 -27070 0.205399174599
chr1 27971776 27971885 NM_002038 IFI6 -26894 0.20685028671
chr12 113364362 113364438 NM_006187 OAS3 -11837 0.377766121894
chr12 113364362 113364438 NM_016816 OAS1 19662 0.276241443606
chr4 169422368 169422769 NM_001012967 DDX60L 20903 0.262863603291
第二部分:
BETA plus:進行激活/抑制功能預測, 直接靶基因預測和靶向區域motif分析,比基礎版本增加了motif分析
BETA plus \
–p peaks.bed \
–e expr.xls \
–k BSF \
–g hg19 \
--gs hg19.fa \
--bl
該模塊與basic相似,以下參數功能有差異:
--gs 基因序列文件。
--mn 指定0~1范圍內數值作為p-value或大于1作為數目來獲取motif的閾值, 默認10。
其他參數參考BETA Basic參數描述。
測試運行日志如下:
[16:49:30] Argument List:
[16:49:30] Name = NA
[16:49:30] Peak File = ../peaks.bed
[16:49:30] Top Peaks Number = 10000
[16:49:30] Distance = 100000 bp
[16:49:30] Genome = hg19
[16:49:30] Expression File = ../beta_specific_expression.txt
[16:49:30] Genome Sequence fasta formated data = hg19.fa
[16:49:30] BETA specific Expression Type
[16:49:30] Number of differential expressed genes = 0.5
[16:49:30] Differential expressed gene FDR Threshold = 1
[16:49:30] Up/Down Prediction Cutoff = 0.001000
[16:49:30] Function prediction based on regulatory potential
[16:49:30] Check ../peaks.bed successfully!
[16:49:30] #ID Status Value
is the header of the expression file
[16:49:30] Checking the differential expression infomation...
[16:49:30] Take the first line with Differential Information as an example: NM_000014 -0.325878 0.618293
[16:49:30] Differential Expression file format successful passed
[16:49:30] Read file <../peaks.bed> OK! All <7031> peaks.
[16:50:16] Process <50801> genes
[16:50:25] Finished! Preliminary results saved into temporary file: <NA.txt>
[11815, 11643]
[16:50:25] Genes were seprated to two parts: up regulated and down regulated.
cut: write error: Broken pipe
cut: write error: Broken pipe
[16:50:26] Prepare file for the Up/Down Test
null device
1
[16:51:03] Finished, Find the result in NA_function_prediction.pdf
[16:51:03] Get the Rank Product of the "upregulate" genes
[16:51:45] Get the Rank Product of the "downregulate" genes
['"upregulate"', '"downregulate"']
[16:52:26] pick out the peaks 100000bp around the selected genes
[16:52:27] Finished: Find target gene associated peaks in plus_out
[16:52:27] get three regions of every peak around the up genes
['up_middle.bed', 'up_left.bed', 'up_right.bed']
[16:53:54] get the fasta format sequence data of the three regions of up
[16:53:55] get three regions of every peak around the down genes
['down_middle.bed', 'down_left.bed', 'down_right.bed']
[16:55:14] get the fasta format sequence data of the three regions of down
[16:55:14] get three regions of every peak around the non genes
['non_middle.bed', 'non_left.bed', 'non_right.bed']
[16:56:37] get the fasta format sequence data of the three regions of non
[16:56:37] run mis to do the known motif scan with cistrome motif database
Calculating Background..
A or T: 93234
C or G: 100766
Others: 0
Calculating Background..
A or T: 97696
C or G: 95334
Others: 0
Calculating Background..
A or T: 97264
C or G: 95766
Others: 0
Calculating Background..
A or T: 118812
C or G: 142588
Others: 0
Calculating Background..
A or T: 125071
C or G: 135022
Others: 0
Calculating Background..
A or T: 123892
C or G: 136201
Others: 0
Calculating Background..
A or T: 181589
C or G: 169211
Others: 0
Calculating Background..
A or T: 191633
C or G: 157413
Others: 0
Calculating Background..
A or T: 191487
C or G: 157559
Others: 0
[16:59:58] T statistic test performed here to do the significance testing of every motif
[17:58:18] motif result will be in html format
Success parser from table.
Success parser from table.
Success parser from table.
Success parser from table.
Success parser from table.
Success parser from table.
[17:58:22] Done: find motif result in beatmotif.html file
[17:58:22] Done: find all BETA result in plus_out
total time: 1:8:52
基本一個小時,2G內存即可完成。
同樣,plus功能生成上/下調靶標文件和靶基因相關的peak文件,此外生成了靶基因相關motif的結果文件(存放于motifresult文件夾,包含一個*motif.html的motif分析文件)。
motif分析示例截圖如下:
同時生成上調、下調差異差異motif的文件:
表頭:MotifID Species Symbol DNA BindDom PSSM Tscore Pvalue
MS00266 Homo sapiens MLX Helix-Loop-Helix Family [[[0.534, 0.012, 0.331, 0.123], [0.01, 0.043, 0.152, 0.795], [0.011, 0.963, 0.011, 0.015], [0.955, 0.01, 0.025, 0.01], [0.01, 0.97, 0.01, 0.01], [0.01, 0.01, 0.97, 0.01], [0.01, 0.01, 0.01, 0.97], [0.01, 0.01, 0.97, 0.01], [0.8, 0.03, 0.117, 0.053], [0.066, 0.241, 0.017, 0.676]]] 5.08 4.30e-07
注: PSSM:位置特異性得分矩陣(position-specific scoring matrix),PSSM示例格式為: [[0.2,0.3,0.3,0.2],[0.1,0.8,0.05,0.05]]; Tscore:t-test的統計量數值。
另外,官網提供了motif分析結果的展示示例參考: http://cistrome.org/BETA/motifresult/betamotif.html。
這里面臨一個問題,有朋友會問一個轉錄因子靶向哪個或哪幾個基因:基于已知的知識可以檢索數據庫,數據庫這里推薦:http://cistrome.org/db/#/或進行預測,比如本工具提供一系列的結合區域(peak)對其靶基因進行預測,但該工具未給出具體的轉錄因子和哪些靶基因存在直接的關系,需要對motif的結合位置情況與peak進行關聯從而初步建立關系。
參考資料
官方使用手冊:http://cistrome.org/BETA/tutorial.html
BETA網頁版分析入口:http://cistrome.org/BETA/index.html#runweb
參考文獻
Wang, S., Sun, H., Ma, J., Zang, C., Wang, C., Wang, J., ... & Liu, X. S. (2013). Target analysis by integration of transcriptome and ChIP-seq data with BETA. Nature protocols, 8(12), 2502-2515.