Bacterial genome annotation script using BLASTN

Ana Mariya Anhel, Lorea Alejaldre, Ángel Goñi-Moreno

Published: 2022-10-27 DOI: 10.17504/protocols.io.dm6gpjrb1gzp/v1

Abstract

This protocol uses a python based script and command-line blastn to annotate Sanger sequencing results from genome amplifications. Its main use in our lab (https://biocomputationlab.com) is to identify the location and gene locus of transposon inserts in microbial bacterial genomes of Pseudomonas putida KT2440. However, this script can be used for other bacterial genomes for which its genome sequence and annotation are available.

Script was developed in python 3.9 with blastn version 2.2.18.

Before start

To run this script command-line blastn and python 3 with packages sys, pandas and os must be installed.

Steps

Annotation of sequencing reads

1.

Download genome file in FASTA format and annotation file in .csv for the microbial organism to use as reference

Note
Pseudomonas genome and annotation files can be found in Pseudomonas genome and annotation files can be found in https://www.pseudomonas.com..

2.

Run the following python based script with the required arguments

#Command to run blastn annotation script 
python alignment_and_annotation_blastn.py [directory of sequencing reads] [type of file] [genome file in fasta format] [annotation file in csv format]

Note
Updated versions of this script can be found in Updated versions of this script can be found in Biocomp GitHub folder

3.

Output is a folder named results_script_blast which contains three files:

  • all_seq_aligned.sam
  • all_seq_aligned.txt
  • table_reads_genes_description.csv

Example: Annotation of sequencing results from P. putida KT2440

4.

Input files

  1. Directory of sequencing reads (it is a zip but shoul be a directory) HC00517465.zip In this case the type of file (extension) is txt
  2. Genome in FASTA format Pseudomonas_putida_KT2440_110.fna
  3. Annotation file of that genome Pseudomonas_putida_KT2440_110.csv
5.

Command-line

bash window where the command is executed (the DB was already created and there was an output directory  existed also)
bash window where the command is executed (the DB was already created and there was an output directory existed also)
bash window where the command is executed without a previously DB created
bash window where the command is executed without a previously DB created
6.

Output files

A new folder named results_script_blast (output files attached in the following zip file) contains a table with information about the alignment and genomic context of each sequencing read.

results_script_blast.zip

ABCDEFGHIJKLMNOPQRS
query acc.s. start% identityalignment lengthmismatchesgap opensevaluebit scoresubject strandLocus TagFeature TypeStartEndStrandGene NameProduct NameSubcellular Localization [Confidence Class]Multiple AllignmentsRest of Locus Tag Associated
H220707-054_B23_219DZAA034_premix.ab1617023999.04299999999999209111.8599999999999997e-103374plusPP_5408CDS6169113061702940--hypothetical proteinCytoplasmic [Class 3]FALSO-
H220707-054_P21_219DZAA035_premix.ab1617023098.618217122.9799999999999995e-106383plusPP_5408CDS6169113061702940--hypothetical proteinCytoplasmic [Class 3]FALSO-
H220707-054_L21_219DZAA036_premix.ab1617023099.539217011.4899999999999998e-109394plusPP_5408CDS6169113061702940--hypothetical proteinCytoplasmic [Class 3]FALSO-
H220707-054_F19_219DZAA037_premix.ab1617023099.083218111.9699999999999993e-108390plusPP_5408CDS6169113061702940--hypothetical proteinCytoplasmic [Class 3]FALSO-
H220707-054_H21_219DZAA038_premix.ab1617054697.22200000000001108121.3e-45182minusPP_5409CDS6170466061723010-glmSL-glutamine/D-fructose-6-phosphate aminotransferase-FALSO-
H220707-054_P19_219DZAA039_premix.ab1617054698.148108022.8e-47187minusPP_5409CDS6170466061723010-glmSL-glutamine/D-fructose-6-phosphate aminotransferase-FALSO-
H220707-054_L19_219DZAA040_premix.ab1617054796.33109136.389999999999999e-44176minusPP_5409CDS6170466061723010-glmSL-glutamine/D-fructose-6-phosphate aminotransferase-FALSO-
H220707-054_N21_219DZAA041_premix.ab1617054699.074108015.679999999999999e-49193minusPP_5409CDS6170466061723010-glmSL-glutamine/D-fructose-6-phosphate aminotransferase-FALSO-
H220707-054_J19_219DZAA046_premix.ab1617053396.8420000000000195039.219999999999999e-38156minusPP_5409CDS6170466061723010-glmSL-glutamine/D-fructose-6-phosphate aminotransferase-FALSO-
H220707-054_D21_219DZAA047_premix.ab1617023999.51700000000001207101.4899999999999996e-104377plusPP_5408CDS6169113061702940--hypothetical proteinCytoplasmic [Class 3]FALSO-

Final table of the alignment with the correspondant gene or locus insertion

推荐阅读

Nature Protocols
Protocols IO
Current Protocols
扫码咨询