在原核生物中,基因往往具有特定且容易识别的启动子序列(信号),如Pribnow盒和转录因子。与此同时,构成蛋白质编码的序列构成一个连续的开放阅读框(内容),其长度约为数百个到数千个碱基对(依据该长度区间可以筛选合适的密码子)。除此之外,原核生物的蛋白质编码还具有其他一些容易判别的统计学的特征。这使得对原核生物的基因预测能达到相对较高的精度。从头计算法现在平均准确度能够达到90%以上,它的正确率主要受着几个方面的影响:genomic islands of differing GC content,pseudogenes and genes with programmed or artificial frameshifts。

MetaGeneAnnotator主要用于原核生物,细菌和古菌,可以是基于预测和宏基因组预测 网页版的总长度不能超过10M 。建议下载该软件解压缩,终端输入如下命令 >/mga所在文件夹/mga/序列所在文件/[multi-fasta] <-m/-s>
-m: multiple species (sequences are individually treated)
-s: single species (sequences are treated as a unit)
MetaGeneMark预测的范围是细菌和古菌(网页版 http://exon.gatech.edu/metagenome/Prediction/)。:,使用方法参照上面的metageneannotator和解压缩后的readme。
Example 1:
gmhmmp -m MetaGeneMark_v1.mod sequence.mfa
Predictions will be in file "sequence.mfa.lst" in default GeneMark.hmm format
Example 2:
gmhmmp -a -d -f G -m MetaGeneMark_v1.mod -o sequence.gff sequence.mfa
Predictions will be in file "sequence.gff" in GFF format with nucleotide and protein sequences for each predicted gene.
其中为sequence.gff生成文件,sequence.mfa为输入文件。
以上就是预测出宏基因组的两个常用的软件,使用起来都比较容易,但是在预测orf后,我们需要对一些脚本来统计数据和分类以便下一步分析。
Prodial (Prokaryotic Gene Prediction Program)(: http://code.google.com/p/prodigal/downloads/list)
prodial -a 生成文件 -i 输入文件 -m -o tmp.txt -p meta
Usage: prodigal [-a trans_file] [-c] [-d nuc_file] [-f output_type]
[-g tr_table] [-h] [-i input_file] [-m] [-n] [-o output_file]
[-p mode] [-q] [-s start_file] [-t training_file] [-v]
-a: Write protein translations to the selected file.
-c: Closed ends. Do not allow genes to run off edges.
-d: Write nucleotide sequences of genes to the selected file.
-f: Select output format (gbk, gff, or sco). Default is gbk.
-g: Specify a translation table to use (default 11).
-h: Print help menu and exit.
-i: Specify input file (default reads from stdin).
-m: Treat runs of n's as masked sequence and do not build genes across them.
-n: Bypass the Shine-Dalgarno trainer and force the program to scan for motifs.
-o: Specify output file (default writes to stdout).
-p: Select procedure (single or meta). Default is single.
-q: Run quietly (suppress normal stderr output).
-s: Write all potential genes (with scores) to the selected file.
-t: Write a training file (if none exists);otherwise, read and use the specified training file.
-v: Print version number and exit.
针对原核生物基因注释工具,预测的结果包括:short genes, long genes, unique genes, dubious genes, broken genes, interrupted genes and putative missed genes。