Roadmap to the bioinformatic study of gene and protein evolution

florian.jacques, Paulina Bolivar, Kristian Pietras

Published: 2022-02-24 DOI: 10.17504/protocols.io.b4s6qwhe

Abstract

We present a compilation of nucleic acid and protein databases and bioinformatic tools for phylogenetic reconstructions and a wide range of studies on molecular evolution and population genetics. We provide a protocol for information extraction from biological databases and simple phylogenetic reconstruction.

Steps

Sequence selection and comparisons

1.

State of the art on genes and proteins

Evolutionary analyses on molecular data, (genes, genomes or proteins), generally require retrieving information from public biological databases. Several dozens of databases store information about the state of the art on genes and proteins. In particular, many of them provide the sequences in fasta format, which is necessary for evolutionary studies. Other information and annotations, including structure, activity, biological function, tissue expression, subcellular location and polymorphism can also prove relevant.

Here is a non-exhaustive list of nucleic acid databases, and a list of the main protein databases, with their main features.

Retrieve the sequence from one of the following databases (e.g. NCBI or Uniprot) and paste the sequence in a fasta file using fasta format, using .fasta as filename extension.

Fasta format includes a headline starting with ">", and the nucleic acid or amino acid sequence.

For example:

sp|P04637|P53_HUMAN Cellular tumor antigen p53 OS=Homo sapiens OX=9606 GN=TP53 PE=1 SV=4 MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGP DEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAK SVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHE RCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNS SCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELP PGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPG GSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD

ABC
DatabaseFeaturesLink
BARDatabase of plant genes and proteinshttp://bar.utoronto.ca/
BgeeGene expression patternshttps://bgee.org/
EnsemblGenome browser of vertebrates, includes tools for identification of homologyhttps://www.ensembl.org/index.html
EntrezGene sequences and structureshttps://www.ncbi.nlm.nih.gov/Web/Search/entrezfs.html
FigTreeVizualization tool for phylogenetic treeshttps://github.com/rambaut/figtree/releases
FlyBaseGenome and proteins of the model insect D. melanogasterhttps://flybase.org/
GenBankAnnotated DNA sequenceshttps://www.ncbi.nlm.nih.gov/genbank/
PomBaseGenes and proteins of the model yeast S. pombehttps://www.pombase.org/
TAIRGenome and proteins of the model plant A. thalianahttps://www.arabidopsis.org/
WormBaseGenome and proteins of the model nematode C. eleganshttps://wormbase.org//#012-34-5
XenbaseGenome and proteins of the model amphibian X. laevishttp://www.xenbase.org/entry/

List of the main nucleic acid databases

ABC
DatabaseFeaturesLink
Gene ontologyUnified annotation of molecular function, biological processes, and cellular components of proteinshttp://geneontology.org/
InterproClassification of proteins domains and functional siteshttps://www.ebi.ac.uk/interpro/
KEGGProtein function and biological pathwayshttps://www.genome.jp/kegg/
PDB3-dimensional structures of proteinshttps://www.rcsb.org/
PfamInformation about protein families and domains, includes tools for identification of homologyhttp://pfam.xfam.org/
PharosCentralizes literature for human proteinshttps://pharos.nih.gov/
PrintsProtein fingerprints classification databasehttp://www.bioinf.man.ac.uk/dbbrowser/PRINTS/
PrositeProtein family databasehttps://prosite.expasy.org/
ScopClassification of protein based on their structurehttps://scop.mrc-lmb.cam.ac.uk/
SuperfamilyProtein structure and functionshttps://supfam.org/
The human protein atlasInformation on human protein and their link with diseaseshttps://www.proteinatlas.org/
UniprotGeneral information on proteinshttps://www.uniprot.org/uniprot/P04637

List of the main protein databases

1.1.

Protein classification

Using protein classification can also be relevant for evolutionary studies. Proteins are classified into different categories based on structural and functional similarity, evolutionary relationship, or both. Retrieving the classification of a protein of interest, and identifying the main protein domains, often provides valuable insight on its biodiversity and evolutionary origin. Several classification systems are published and listed in the table above.

Use a classification system (e.g. Pfam or Interpro) to identify the main domains of a protein. Pfam presents also their occurrence in living organisms as a sunburst plot.

This figure displays the diversity of the domain P53 that can be retrieved from Pfam. It shows that this domain is present in virtually all animals, and some of their close relatives, such as choanoflagellates, and suggests that this domain appeared before the divergence between animals and these protists.

Sunburst plot of the distribution of the p53 protein domain (PF00870) in living organisms according to Pfam.
Sunburst plot of the distribution of the p53 protein domain (PF00870) in living organisms according to Pfam.
2.

Identification of homologues

Studying the evolution of a family of genes or proteins requires the identification of homologues, i.e. , genes or protein with shared ancestry. Homologues include orthologues, that are present in different species, and paralogues, that are present in the same genome. Bioinformatic tools can be used to identify gene or protein homology based on sequence similarity, in the genomes of any species (see the list below).

Use the sequence of your gene of interest under fasta format to identify homologues in the genomes of other species using BLAST, for example. Paste the sequences of the homologues in a single fasta file and select the species or groups of species of interest (e.g. all animals and choanoflagellates). Download the sequences of the homologues covering the diversity of animals and paste them under fasta format in another fasta file.

Here is a list of tools that can be used to identify sequence homologues:

ABC
DatabaseFeaturesLink
BLASTProtein or DNA homology searchhttps://blast.ncbi.nlm.nih.gov/Blast.cgi
BlatSequence searchhttps://genome.ucsc.edu/cgi-bin/hgBlat
EnsemblDNA homology searchhttps://m.ensembl.org/index.html
FASTASequence search and alignmenthttps://www.ebi.ac.uk/Tools/sss/fasta/
HMMERProtein homology searchhttp://hmmer.org/
ShahaSequence search and alignmenthttps://www.sanger.ac.uk/tool/ssaha/

List of bioinformatic tools for identification of gene and protein homologues

3.

Sequence alignment

Phylogenetic analysis requires identification of homologue bases or amino acid residues of the different sequences. Homology is inferred by a sequence alignment. Sequences of different genes or proteins are put in every row one after the other to arrange every homologous base or amino acid in the same column.

Here is a list of tools for nucleic acid or amino acid sequence alignment. Use an appropriate tool (nucleic acid or amino acid) to generate an alignment of the sequences. Insertions and deletions are indicated by gaps "-" added to the sequences.

It is sometimes recommended to check the alignment and, when necessary, to improve it manually or using GBlocks. Then, export the alignment using fasta format.

ABC
SoftwareFeaturesLink
BAli-PhyBayesian alignment and phylogenetic analysishttp://www.bali-phy.org
ClustalOmegaDNA or amino acid sequence alignmenthttps://www.ebi.ac.uk/Tools/msa/clustalo/
ClustalWDNA or amino acid sequence alignmenthttps://www.genome.jp/tools-bin/clustalw
GBlocksDNA or amino acid alignmenthttp://molevol.cmima.csic.es/castresana/Gblocks_server.html
MAFFTDNA or amino acid sequence alignmenthttps://mafft.cbrc.jp/alignment/server/
MuscleDNA or amino acid sequence alignmenthttps://www.ebi.ac.uk/Tools/msa/muscle/
PrankDNA or amino acid sequence alignmenthttp://wasabiapp.org/software/prank/
ProbconsAmino acid sequence alignmenthttp://probcons.stanford.edu/
T-CoffeeSequence alignmenthttp://tcoffee.crg.cat/

List of programs for sequence alignment

Phylogenetic analysis

4.

Phylogenetic inference

The evolutionary history of gene, protein or species is generally presented as a phylogenetic tree, a graphical illustration of the evolutionary relationships between the studied taxa. Several methods for phylogenetic inference exist: Maximum Parsimony, the distance-methods and the probabilistic methods. Choose an appropriate method to reconstruct the phylogenetic tree of the genes/proteins of interest. The probabilis| A | | --- | | |

e most widely used for molecular data, but many studies use other methods, distance methods in complement.

Choose one or several of these methods to reconstruct the evolutionary history of your gene, protein or species of interest.

He| A | | --- | | |

reconstruction using diverse methods, and visualization of phylogenetic trees that can be used in complement. For beginners, we recommend using SeaView or Mega with ML analyses, which are very user-friendly and include several tools for sequence alignment, phylogenetic inference including maximum parsimony, distance methods and probabilistic methods, and a tree visualization tool. Advanced users generally prefer the more complex PhyML or RaxML.

ABC
SoftwareFeaturesLink
ETEToolkitVisualization and analysis of phylogenetic treeshttp://etetoolkit.org/
FigTreeGraphic software for phylogenetic treeshttp://tree.bio.ed.ac.uk/software/figtree/
ITOLVisualization and annotation of phylogenetic treeshttps://itol.embl.de/
MegaSequence alignment, model selection, phylogenetic analysis (parsimony, distance methods)https://www.megasoftware.net/
MrBayesBayesian phylogenetic inference, ancestral states reconstruction, phylogenetic calibration…http://nbisweden.github.io/MrBayes/
PAMLMaximum likelihood phylogenetic inference, estimation of selection strength, ancestral states reconstruction and other analyseshttp://abacus.gene.ucl.ac.uk/software/paml.html
PAUPPhylogenetic analyseshttp://paup.phylosolutions.com/
PhyMLMaximum likelihood phylogenetic inference, ancestral states reconstruction and other analyseshttp://atgc.lirmm.fr/phyml/
RaxMLMaximum likelihood phylogenetic inferencehttps://cme.h-its.org/exelixis/web/software/raxml/
SeaViewSequence alignment and phylogenetic analysis (parsimony, NJ, ML)http://doua.prabi.fr/software/seaview

List of programs and databases for phylogenetic analysis using diverse methods

A
4.1.

Option 1: Maximum parsimony

Maximum parsimony is a classic and simple method, that calculates the minimum number of evolutionary steps, including nucleotide insertions, deletions or substitutions, between species.

However, this method ignores hidden mutations and does not take into account branch lengths, potentially leading to long branch attraction, which is an incorrect clustering of unrelated taxa. Furthermore, it does not consider the possibility of hidden mutations, making it not relevant for distant taxa. While maximum parsimony is still used for morphological data, it is no longer considered relevant for molecular data.

4.2.

Option 2: Distance methods

The Unweighted Pair Group Method with Arithmetic mean (UPGMA), Neighbor Joining (NJ), and Minimum Evolution (ME) are all based on the overall molecular distance, defined by the number of differences between the sequences.

However, this method also ignores hidden mutations and is also subject to long branch attraction. These methods, in particular NJ and ME, are still often used in complement to probabilistic methods in scientific studies.

4.3.

Option 3: Probabilistic methods

The strength of probabilistic methods is the use of specified models of molecular evolution. Probabilistic methods take into account different mutation rates between sites to avoid mutation saturation. Nowadays, almost all studies of phylogenetic reconstruction use probabilistic methods.

They include Maximum Likelihood (ML) and Bayesian inference (BI). ML calculates the probability of observing the data (in this case, the sequence alignment) under different explicit models of molecular evolution. ML aims to identify the best fit model by exploring multiple combinations of model parameters. BI evaluates the probability of each model of molecular evolution given the data. We recommend using different methods of phylogenetic reconstruction, including ML and BI.

In our example, we used Maximum likelihood to build the phylogenetic tree of the P53 family (including the p73, p63 and p53 proteins of different animal species), and FigTree for a graphical representation of the phylogenetic tree.

Phylogenetic tree of p53 domain-containing proteins of metazoans using maximum likelihood.
Phylogenetic tree of p53 domain-containing proteins of metazoans using maximum likelihood.
4.4.

Model prediction

Probabilistic methods require a selection of the best model of molecular evolution for the data. Several models of nucleotide evolution exist, differing in their number of parameters like mutation rates between sites. Models of protein evolution also exist. A few programs can be used to calculate a criterion (AIC or BIC) to evaluate the likelihood of different published model for the alignment. Use one of them to select the best fit model for your alignment, which is the model with the lowest AIC or BIC criterion.

If you are using probabilistic methods, use the aligned sequences and a model prediction tool to determine the best model of nucleotide or aminoacid substitution.

Here is a list of tools for selection of models of molecular evolution:

Model selectors are also implemented in software for evolutionary analysis such as Mega.

ABC
SoftwareFeaturesLink
ModelTest / JModelTestNucleotide substitution model selectionhttp://evomics.org/resources/software/molecular-evolution-software/modeltest/
ProtTestAminoacid substitution model selectionhttps://github.com/ddarriba/prottest3
SMSNucleotide or aminoacid model selection included in PhyMLhttp://www.atgc-montpellier.fr/sms/

List of programs for molecular evolution model selection

4.5.

Tree rooting

Phylogenetic trees can be unrooted or rooted. Rooting a tree consists in identifying ancestral and derived states. This is useful to study the direction of the evolution and interpret the evolution of the studied taxa.

Diverse methods have been developed to root phylogenetic trees:

-Including outgroups (taxa that do not belong to the studied ingroup but are closely related) in the analysis. It is recommended to choose two outgroups, one being more closely related to the ingroup than to the other outgroup. If outgroups are available, this method can be preferred.

-Midpoint rooting, which corresponds to setting the root at the mid-point of the longest branches.

-Molecular clock rooting, a method that assumes that evolution speed is constant between the sequences. Since evolution speed can vary lot between lineages, this method should be restricted to close species.

Working with phylogenies

5.

Reconstruct the evolution of the gene or protein

Sequence alignments and phylogenetic trees can be used to reconstruct diverse aspects of the evolutionary history of genes, proteins and species, as well as the study the genetic structures within populations. In this last section, we provide a brief and non-exhaustive overview of the main evolutionary studies that can be performed using bioinformatic tools.

5.1.

Reconstitution of ancestral states

Retracing the functional evolution of genes, proteins, or biological traits often requires the reconstitution of ancestral states. Ancestral states can be inferred from a phylogenetic tree using maximum of parsimony, maximum likelihood, or Bayesian inference; and requires the aligned sequences and the model of molecular evolution that has been used for the phylogenetic analysis (when using probabilistic methods only).

ABC
SoftwareFeaturesLink
BayesTraitsEvolutionary analyses using Bayesian inferencehttp://www.evolution.rdg.ac.uk/BayesTraitsV3.0.5/BayesTraitsV3.0.5.html
BEASTPhylogenetic calibrationhttp://www.beast.community
MegaSequence alignment, model selection, phylogenetic analysis (parsimony, distance methods)https://www.megasoftware.net/
MesquiteComparative analyses and statisticshttp://www.mesquiteproject.org/
MrBayesBayesian phylogenetic inference, ancestral states reconstruction, phylogenetic calibration…http://nbisweden.github.io/MrBayes/
PAMLMaximum likelihood phylogenetic inference, estimation of selection strength, ancestral states reconstruction and other analyseshttp://abacus.gene.ucl.ac.uk/software/paml.html
RASPAncestral states reconstructionhttp://mnh.scu.edu.cn/soft/blog/RASP/index.html

List of programs and databases for ancestral states reconstruction

A
5.2.

Measure of selection strength

The strength of selection on protein coding genes can be calculated by evaluating the ration of the number of non-synonymous mutations (mutations changing the protein sequence) per non-synonymous site (dN) and the number of synonymous mutations (mutations with no effect on the protein sequence due to the redundancy of the genetic code) per synonymous site (dS). The ratio dN/dS reveals, if >1, that non-synonymous mutations are higher than expected and the gene is under positive selection. If dN/dS<1, the gene is under purifying selection and if dN/dS=1, the selection is neutral. Selection strength can be calculated by sequence alignment programs such as Mega.

5.3.

Phylogenetic calibration

Phylogenetic calibration consists in estimating the age of speciation events with events with a known age, i.e., using fossil and other geological data (that can only give minimal ages) as calibration points. Alternatively, mutation rates can be used to calculate the divergence time between two sequences.

Databases such as Timetree, also implemented in Mega, compute the estimated divergence time between species. Mesquite also provides tools to calibrate phylogenetic trees in geological times using fossil data. Ohnologs can also be used to estimate the divergence time between homologues resulting from whole genome duplications in vertebrates.

ABC
SoftwareFeaturesLink
BayesTraitsEvolutionary analyses using Bayesian inferencehttp://www.evolution.rdg.ac.uk/BayesTraitsV3.0.5/BayesTraitsV3.0.5.html
BEASTPhylogenetic calibrationhttp://www.beast.community
MegaSequence alignment, model selection, phylogenetic analysis (parsimony, distance methods)https://www.megasoftware.net/
MesquiteComparative analyses and statisticshttp://www.mesquiteproject.org/
MrBayesBayesian phylogenetic inference, ancestral states reconstruction, phylogenetic calibration…http://nbisweden.github.io/MrBayes/
OhnologsDatabase of vertebrate ohnologues, resulting from whole genome duplicationshttp://ohnologs.curie.fr/
TimetreeTree calibrationhttp://www.timetree.org/

List of programs and databases that can be used for phylogenetic calibration using diverse methods

5.4.

Study of coevolution

Evolutionary ecology and parasitology often study the evolution of hosts and parasites association through an approach of coevolution, or reciprocal genetic change between different species. Coevolution can also be used to study associations of genes or proteins. Studying coevolution consists in identifying evolutionary events such as co-speciation, host change, and duplication or loss of interaction from the phylogenetic trees of the hosts and the parasites.

In our second example, we used Maximum likelihood to reconstruct the evolution of cyclins and CDKs, two families of proteins involved in cell cycle control and closely interacting. We used Jane and TreeMap to reconstruct coevolutionary scenarios between the two families.

In both figures, the clustering of cyclins and CDKs indicate an interaction (in this case, that the cyclin can bind the CDK). Red spots indicate significant events of coevolution between the two families of proteins. Co-speciation (hollow red circle), duplications (solid red circle), duplications with host switch (yellow circle), loss of interaction (dashed lines), failures to diverge (jagged lines) are indicated on the cophylogeny.

ABC
SoftwareFeaturesLink
CopycatSoftware for studying coevolutionhttp://www.cophylogenetics.com/
Core-PASoftware for studying coevolutionhttp://pacosy.informatik.uni-leipzig.de/49-1-CoRe-PA.html
JaneSoftware for studying coevolutionhttps://www.cs.hmc.edu/~hadas/jane/
TreeMapSoftware for studying co-evolutionhttps://sites.google.com/site/cophylogeny/treemap

List of programs to study coevolution

Two co-evolutionary scenarios of the associations between, and co-evolution of, human cyclins and CDKs
Two co-evolutionary scenarios of the associations between, and co-evolution of, human cyclins and CDKs
5.5.

Phylogenetic comparative analysis

Evolutionary biology often employs the so-called phylogenetic comparative methods to study the adaptive significance of biological traits. These methods aim at identifying biological characters, in terms of morphology, physiology, or ecology, that result from a shared ancestry. Comparative analyses can be done for quantitative or qualitative variables. A very appropriate tool for comparative analysis and to compute statistics on phylogenetic trees is Mesquite.´BayesTraits can also be used.

5.6.

Genome evolution

Evolutionary events, such as mutations, insertions, deletions, gene or whole genome duplications, genome reorganization, and genetic exchanges can be identified using phylogenetic trees in complement with genomics tools and databases. Here is a list of databases tools to study diverse aspects of genome evolution, including genome browsers of diverse lineages and tools for comparative genomics and evolutionary genomics:

ABC
SoftwareFeaturesLink
BARDatabase of plant genes and proteinshttp://bar.utoronto.ca/
CAFEGene family evolutionhttps://github.com/hahnlab/CAFE5
CoGEComparative genomics analyseshttps://genomevolution.org/coge/
EnsemblGenome browser of vertebrates, includes tools for identification of homologyhttps://www.ensembl.org/index.html
EntrezGene sequences and structureshttps://www.ncbi.nlm.nih.gov/Web/Search/entrezfs.html
FlyBaseGenome and proteins of the model insect D. melanogasterhttps://flybase.org/
GenBankAnnotated DNA sequenceshttps://www.ncbi.nlm.nih.gov/genbank/
HGT-FinderHorizontal gene transfer findinghttp://cys.bios.niu.edu/HGTFinder/ HGTFinder.tar.gz
OhnologsDatabase of vertebrate ohnologues, resulting from whole genome duplicationshttp://ohnologs.curie.fr/
PomBaseGenes and proteins of the model yeast S. pombehttps://www.pombase.org/
TAIRGenome and proteins of the model plant A. thalianahttps://www.arabidopsis.org/
WormBaseGenome and proteins of the model nematode C. eleganshttps://wormbase.org//#012-34-5
XenbaseGenome and proteins of the model amphibian X. laevishttp://www.xenbase.org/entry/

List of programs and databases to study genome evolution

5.7.

Population genetics

Genetic diversity can also be explored at the population level by analyzing polymorphism between members of the same species. Gene polymorphisms studies allele diversity within a population, including single nucleotide polymorphisms (SNP), indels, microsatellites or transposable elements. Mathematical models have been developed to describe polymorphism. Several programs are suitable for population genetics studies.

ABC
SoftwareFeaturesLink
ArlequinPopulation genetics analyseshttp://cmpg.unibe.ch/software/arlequin35/
DNAspAnalysis of DNA polymorphismhttp://www.ub.edu/dnasp/
GenepopPopulation genetics analyseshttps://genepop.curtin.edu.au/
SNiplaySNP detection and other population genetics analyseshttps://sniplay.southgreen.fr/cgi-bin/home.cgi

List of programs and databases for population genetics

5.8.

Study of protein structure and function evolution

Studying the functional evolution of proteins can require structure alignments, that can be realized by appropriate programs such as PyMol, and the mean distance in Å between homologous residues can be calculated .

Other programs allow to identify and study structures conservation between proteins and infer a structures from a protein sequence. These analyses and a phylogenetic tree of the protein families, together with the reconstruction of ancestral states, can facilitate the study of the evolution of protein functions within the family.

Here is a list of tools that can be used for analyses on protein structures in an evolutionary framework:

ABC
SoftwareFeaturesLink
PyMol3D visualization of moleculeshttp://www.mesquiteproject.org/
I-TasserProtein structure predictionhttps://zhanglab.dcmb.med.umich.edu/I-TASSER/
ForsaProtein structure predictionhttp://www.bo-protscience.fr/forsa/
HHPredProtein structure predictionhttps://toolkit.tuebingen.mpg.de/tools/hhpred

List of programs for protein structure analyses

推荐阅读

Nature Protocols
Protocols IO
Current Protocols
扫码咨询