學習目標: 前面下載了SRR3589956.sra-SRR3589962.sra的RNA-seq數據,本次用sratoolkit.2.6.3軟件解壓,并查看fastq數據的格式,用fastqc軟件檢驗其數據質量,IGV可視化數據,學會批量操作。
參考:http://www.biotrainee.com/thread-1831-1-1.html
http://fbb84b26.wiz03.com/share/s/3XK4IC0cm4CL22pU-r1HPcQQ2irG2836uQYm2iZAyh1Zwf3_
1. sratoolkit的使用
fastq-dump -h查看幫助
fastq-dump [options] <path> [<path>...] #基本用法
常用參數:
INPUT
-A|--accession <accession> Replaces accession derived from <path> in
filename(s) and deflines (only for single
table dump)
--table <table-name> Table name within cSRA object, default is
"SEQUENCE"
OUTPUT
-O|--outdir <path> Output directory, default is working
directory '.' )
-Z|--stdout Output to stdout, all split data become
joined into single stream
--gzip Compress output using gzip #fastqc軟件可以直接識別gzip壓縮的文件
--bzip2 Compress output using bzip2 #比gzip壓縮率高但是慢
Multiple File Options Setting these options will produce more
than 1 file, each of which will be suffixed
according to splitting criteria.
--split-files Dump each read into separate file.Files
will receive suffix corresponding to read
number
--split-3 Legacy 3-file splitting for mate-pairs:
First biological reads satisfying dumping
conditions are placed in files *_1.fastq and
*_2.fastq If only one biological read is
present it is placed in *.fastq Biological
reads and above are ignored.
學會批量解壓:
for i in `seq 56 62`
do
/opt/NfsDir/BioDir/sratoolkit/sratoolkit.2.6.3-centos_linux64/bin/fastq-dump --gzip --split-3 -O /opt/NfsDir/UserDir/qin/qin/Data/RNAseq/ -A SRR35899${i}.sra
done
bash命令能夠直接用于解壓縮文件,如zgrep,zcat,zless,zdiff等。舉例:zcat SRR3589956_1.fastq.gz | head -n 4
2.fastqc批量查看測序質量
參考:http://www.biotrainee.com/thread-324-1-1.html
格式: FASTQ文件每個序列通常為4行,分別為:
@DJB775P1:248:D0MDGACXX:7:1202:12362:49613 1:Y:18:ATCACG #第一行:@字符開頭的標題行,分別為:設備名稱/run id/flowcell id/flowcell lane/tile number within the flowcell lane/'x'-coordinate of the cluster within the tile/'y'-coordinate of the cluster within the tile/the member of a pair, 1 or 2/Y if the read is filtered, N otherwise/0 when none of the control bits are on, otherwise it is an even number/index sequence
TGCTTACTCTGCGTTGATACCACTGCTTAGATCGGAAGAGCACACGTCTGAA #序列
+
JJJJJIIJJJJJJHIHHHGHFFFFFFCEEEEEDBD?DDDDDDBDDDABDDCA #堿基質量格式phred+33
fastqc用法:
fastqc SRR3589956_1.fastq.gz
fastqc seqfile1 seqfile2 .. seqfileN
常用參數:
-o: 輸出路徑-
-extract: 輸出文件是否需要自動解壓 默認是--noextract-
t: 線程, 和電腦配置有關,每個線程需要250MB的內存
-c: 測序中可能會有污染, 比如說混入其他物種
-a: 接頭-
q: 安靜模式
結果產生兩個文件 Paste_Image.png
查看SRR3589956質控結果,為啥中間少了一塊?
Paste_Image.png
multiQC批量質控查看結果
# 先獲取QC結果
ls *gz | while read id; do /opt/NfsDir/BioDir/fastqc/FastQC/fastqc -t 4 $id; done
# multiqc
multiqc *fastqc.zip --pdf
Paste_Image.png
Paste_Image.png
Paste_Image.png