三代测序如今越来越普及了,小编就介绍一下用来比对三代数据的工具blasr。类似于blast,但与他不同,因为blasr侧重于长序列整体比对了,并且容忍一定的错误率,而blast更测重于局部较严格的比对了。具体各个比对工具的进化史如下图:
废话不多说,开始介绍用法。
一、主要参数
1、输入测序数据文件
reads.fasta :直接fasta文件,比较常用,也可以是下面格式的:
reads.bax.h5|reads.plx.h5
2、比对参数
-minMatch m (12)
Minimum seed length. Higher minMatch will speed up alignment, butdecrease sensitivity.
-maxMatch l (inf)
Stop mapping a read to the genome when the lcplength reaches l. This is useful when the query is part of the reference, forexample when constructing pairwise alignments for de novo assembly.
-maxLCPLength l (inf)
Thesame as -maxMatch.
-maxAnchorsPerPosition m (10000)
Do not add anchors from a position if itmatches to more than 'm' locations in the target.
-advanceExactMatches E (0)
Another trick for speeding up alignmentswith match - E fewer anchors. Rather than finding anchors between the read andthe genome at every position in the read, when an anchor is found at position iin a read of length L, the next position in a read to find an anchor is ati+L-E. Use this when alignining already assembled contigs.
-nCandidates n (10)
Keep up to 'n' candidates for the bestalignment. A large value of n will slowmapping because the slower dynamicprogramming steps are applied to more clusters of anchors which can be a ratelimiting step when reads are very long.
3、其他参数
-nproc N (1) CPU个数设置
-minPctIdentity p (0):identity设置
-minReadLengthl(50):比对最短的read长度要求
-minSubreadLength l(0)
Do not align subreads of lengthless than l.
-bestnn (10):输出最佳的结果个数
-sam 输出sam格式文件
-clipping [none|hard|subread|soft] (none)
Use no/hard/subread/softclipping for SAM output.
-out out (terminal):输出文件名字设置
-unaligned file:输出未比对的read
-mt 输出格式设置
If not printing SAM, modify the output of the alignment.
t=0 Print blast like output with |'s connecting matched nucleotides.
1 Print only a summary: score and pos.
2Print in Compare.xml format.
3Print in vulgar format (deprecated).
4Print a longer tabular version of the alignment.
5 Print in a machine-parsable format that isread by compareSequences.py.
二、用法
blasr reads genome.fasta [-options]
三、输出格式介绍
(a) blasr option: -m 0
blasr like human-readable output with |'sconnecting matched nucleotides.
(b) blasr option: -m 1
Space-delimited summary of alignmentscontaining 11 fields:
qName tName qStrand tStrand scorepercentSimilarity tStart tEnd tLength qStart qEnd qLength nCells
(c) blasr option: -m 2
XML format.
(d) blasr option: -m 3
Vulgar format (deprecated).
(e) blasr option: -m 4
Space-delimited summary of alignmentscontaining 13 fields:
qName tName score percentSimilarity qStrandqStart qEnd qLength tStrand tStart tEnd tLength mapQV
(f) blasr option: -m 5
Space-delimited machine-parsable formatcontaining 19 fields:
qName qLength qStart qEnd qStrand tNametLength tStart tEnd tStrand score numMatch numMismatch numIns numDel mapQVqAlignedSeq matchPattern tAlignedSeq
(g) blasr option: -sam
SAM format. SAM 文件各标签介绍:
(1)"XS": 1 plus (first base of SEQ in 0 based coordinate of zmw unrolledpolymerase read), inclusive, where SEQ is SAM mandatory field column 10.
(2)"XE": 1 plus (last base of SEQ in 0 based coordinate of zmw unrolledpolymerase read), exclusive.
(3)"XL": number of aligned query bases
(4)"XQ": length of zmw unrolled polymerase read.
(5)"XT": number of continues reads, always 1 for blas
(6)"YS": first base of query subread in 0 based coordinate of zmwunrolled polymerase read, inclusive. movie/zmw/YS_YE
(7)"YE": last base of query subread in 0 based coordinate of zmw unrolledpolymerase read, exclusive.
(8)"ZM": zmw number.