DNA各种序列格式介绍

admin 64 2024-12-18 编辑

DNA各种序列格式介绍

1.Plain格式

A sequence in plain format may contain only IUPAC characters and spaces (no numbers!).Note: A file in plain sequence format may only contain one sequence, while most other formats accept several sequences in one file.An example sequence in plain format is:ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCCTGGAGGGTGGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGACTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATTTTAATTACAGACCTGAA

Plain sequence序列格式,只含有IUPAC字符和空格,不含有数字,并且一个Plain格式的文件只能含有一条序列。

2.EMBL格式

A sequence file in EMBL format can contain several sequences.One sequence entry starts with an identifier line (“ID”), followed by further annotation lines. The start of the sequence is marked by a line starting with “SQ” and the end of the sequence is marked by two slashes (“//”).An example sequence in EMBL format is:ID AB000263 standard; RNA; PRI; 368 BP.XXAC AB000263;XXDE Homo sapiens mRNA for prepro cortistatin like peptide, complete cds.XXSQ Sequence 368 BP;acaagatgcc attgtccccc ggcctcctgc tgctgctgct ctccggggcc acggccaccg 60ctgccctgcc cctggagggt ggccccaccg gccgagacag cgagcatatg caggaagcgg 120caggaataag gaaaagcagc ctcctgactt tcctcgcttg gtggtttgag tggacctccc 180aggccagtgc cgggcccctc ataggagagg aagctcggga ggtggccagg cggcaggaag 240gcgcaccccc ccagcaatcc gcgcgccggg acagaatgcc ctgcaggaac ttcttctgga 300agaccttctc ctcctgcaaa taaaacctca cccatgaatg ctcacgcaag tttaattaca 360gacctgaa 368//

EMBL格式文件可以包含多条序列,每个序列条目都以”ID”开始,紧跟一些注释信息,序列的开始标记为”SQ”,结束标记为”//”。

3.FASTA格式

A sequence file in FASTA format can contain several sequences.Each sequence in FASTA format begins with a single-line description, followed by lines of sequence data.The description line must begin with a greater-than (“>”) symbol in the first column.An example sequence in FASTA format is:>AB000263 |acc=AB000263|descr=Homo sapiens mRNAACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCCCTGGAGGGTGGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAACTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGTTTAATTACAGACCTGAA

FASTA格式文件可以包含多条序列,每条序列之前都有以”>”开始的一行,该行包含一些序列的描述信息。

4.GCG格式

A sequence file in GCG format contains exactly one sequence, begins with annotation lines and the start of the sequence is marked by a line ending with two dot (“..”) characters. This line also contains the sequence identifier, the sequence length and a checksum. This format should only be used if the file was created with the GCG package.An example sequence in GCG format is:ID AB000263 standard; RNA; PRI; 368 BP.XXAC AB000263;XXDE Homo sapiens mRNA for prepro cortistatin like peptide, complete cds.XXSQ Sequence 368 BP;AB000263 Length: 368 Check: 4514 ..1 acaagatgcc attgtccccc ggcctcctgc tgctgctgct ctccggggcc acggccaccg61 ctgccctgcc cctggagggt ggccccaccg gccgagacag cgagcatatg caggaagcgg121 caggaataag gaaaagcagc ctcctgactt tcctcgcttg gtggtttgag tggacctccc181 aggccagtgc cgggcccctc ataggagagg aagctcggga ggtggccagg cggcaggaag241 gcgcaccccc ccagcaatcc gcgcgccggg acagaatgcc ctgcaggaac ttcttctgga301 agaccttctc ctcctgcaaa taaaacctca cccatgaatg ctcacgcaag tttaattaca361 gacctgaa

GCG格式文件只含有一条序列,以一些注释信息行开始,序列以”..”行开始,该行还包含序列的标识,以及长度等。

5.GenBank格式

A sequence file in GenBank format can contain several sequences.One sequence in GenBank format starts with a line containing the word LOCUS and a number of annotation lines. The start of the sequence is marked by a line containing “ORIGIN” and the end of the sequence is marked by two slashes (“//”).An example sequence in GenBank format is:LOCUS AB000263 368 bp mRNA linear PRI 05-FEB-1999DEFINITION Homo sapiens mRNA for prepro cortistatin like peptide, complete cds.ACCESSION AB000263ORIGIN1 acaagatgcc attgtccccc ggcctcctgc tgctgctgct ctccggggcc acggccaccg61 ctgccctgcc cctggagggt ggccccaccg gccgagacag cgagcatatg caggaagcgg121 caggaataag gaaaagcagc ctcctgactt tcctcgcttg gtggtttgag tggacctccc181 aggccagtgc cgggcccctc ataggagagg aagctcggga ggtggccagg cggcaggaag241 gcgcaccccc ccagcaatcc gcgcgccggg acagaatgcc ctgcaggaac ttcttctgga301 agaccttctcc ctcctgcaaa taaaacctca cccatgaatg ctcacgcaag tttaattaca361 gacctgaa//

GenBank格式文件可以包含多个序列,每个序列条目都以”LOCUS”开始,紧跟多行注释信息,序列开始标记为”ORIGIN”,序列结束标记为”//”。

6.IG格式

A sequence file in IG format can contain several sequences, each consisting of a number of comment lines that must begin with a semicolon (“;”), a line with the sequence name (it may not contain spaces!) and the sequence itself terminated with the termination character ’1′ for linear or ’2′ for circular sequences.An example sequence in IG format is:; comment; commentAB000263ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCGCCTGGAGGGTGGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAATAAGCTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGGCTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGTTTAATTACAGACCTGAA1

IG格式序列文件可以包含多个序列,每个序列条目都以多个comment行开始,且comment行以”;”开始,comment行下面是包含序列名称的一行,序列以数字1结束,第2条序列以2结束,以此类推。

7.IUPAC字符

To represent ambiguity in DNA sequences the following letters can be used (following the rules of the International Union of Pure and Applied Chemistry (IUPAC)):A = adenineC = cytosineG = guanineT = thymineU = uracilR = G A (purine)Y = T C (pyrimidine)K = G T (keto)M = A C (amino)S = G CW = A TB = G T CD = G A TH = A C TV = G C AN = A G C T (any)

DNA各种序列格式介绍

上一篇: 探索分子生物学实验工具类型如何提升生物技术的细胞分离与实验效率
下一篇: 如何通过用户行为分析提升抖音序列分析工具的效果,优化市场营销策略?
相关文章