Roadmap to the bioinformatic study of gene and protein phylogeny and evolution - a practical guide
florian.jacques, Paulina Bolivar
Evolution
bioinformatics
phylogenetic analysis
evolutionary studies
molecular evolution
Phylogenetic inference
Abstract
We present a compilation of nucleic acid and protein databases and bioinformatic tools for phylogenetic reconstructions and a wide range of studies on molecular evolution and population genetics. We provide a protocol for information extraction from biological databases and simple phylogenetic reconstruction.
Steps
Sequence collection and comparison
Collecting sequence data and information on genes and proteins
Evolutionary analyses on molecular data (genes, genomes or proteins), require retrieving information from public databases. Several dozens of databases store information about the state of the art on genes and proteins. Most of them provide the sequences in fasta format, which is necessary for evolutionary studies. Other information and annotations, including structure, activity, biological function, tissue expression, subcellular location and polymorphism can also prove relevant.
Retrieve the sequence from one of the following databases (e.g. NCBI or Uniprot ) and paste the sequence in a fasta file using fasta format, using .fasta as filename extension.
Fasta format includes a headline starting with ">", and the nucleic acid or amino acid sequence.
For example, in the case of the human P53 protein:
sp|P04637|P53_HUMAN Cellular tumor antigen p53 OS=Homo sapiens OX=9606 GN=TP53 PE=1 SV=4 MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGP DEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAK SVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHE RCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNS SCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELP PGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPG GSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD
Here is a non-exhaustive list of nucleic acid databases, and a list of protein databases, with their main features.
A | B | C |
---|---|---|
Database | Features | Link |
BAR | Database of plant genes and proteins | http://bar.utoronto.ca/ |
Bgee | Gene expression patterns | https://bgee.org/ |
Ensembl | Genome browser of vertebrates, includes tools for identification of homology | https://www.ensembl.org/index.html |
Entrez | Gene sequences and structures | https://www.ncbi.nlm.nih.gov/Web/Search/entrezfs.html |
FlyBase | Genome and proteins of the model insect D. melanogaster | https://flybase.org/ |
GeneCards | Human gene function, genomics, transcription factor binding sites and protein products | https://www.genecards.org/ |
GenBank | Annotated DNA sequences | https://www.ncbi.nlm.nih.gov/genbank/ |
NCBI | Collection of databases for molecular biology and medicine providing many bioinformatics tools and services | https://www.ncbi.nlm.nih.gov/ |
PomBase | Genes and proteins of the model yeast S. pombe | https://www.pombase.org/ |
TAIR | Genome and proteins of the model plant A. thaliana | https://www.arabidopsis.org/ |
WormBase | Genome and proteins of the model nematode C. elegans | https://wormbase.org//#012-34-5 |
Xenbase | Genome and proteins of the model amphibian X. laevis | http://www.xenbase.org/entry/ |
List of nucleic acid databases
A | B | C |
---|---|---|
Database | Features | Link |
Gene Ontology | Unified annotation of molecular function, biological processes, and cellular components of proteins | http://geneontology.org/ |
Human Protein Atlas | Information on human protein and their link with diseases | https://www.proteinatlas.org/ |
InterPro | Classification of proteins domains and functional sites | https://www.ebi.ac.uk/interpro/ |
KEGG | Protein function and biological pathways | https://www.genome.jp/kegg/ |
PDB | 3-dimensional structures of proteins | https://www.rcsb.org/ |
Pfam | Information about protein families and domains, includes tools for identification of homology | http://pfam.xfam.org/ |
PHAROS | Centralizes literature for human proteins | https://pharos.nih.gov/ |
PRINTS | Protein fingerprints classification database | http://www.bioinf.man.ac.uk/dbbrowser/PRINTS/ |
PROSITE | Protein family, domains and functional sites | https://prosite.expasy.org/ |
SCOP | Structure-based classification of protein | https://scop.mrc-lmb.cam.ac.uk/ |
SUPERFAMILY | Protein structure and functions | https://supfam.org/ |
UniProt | General information on proteins, including sequences, structure, classification, function, subcellular localization and simple homology identification | https://www.uniprot.org/uniprot/ |
List of protein databases
Protein domains classification (optional)
Studying protein classification can also be useful for evolutionary studies. Proteins are classified into different categories based on structural and functional similarity, evolutionary relationship, or both. Retrieving the classification of a protein of interest and identifying the main protein domains often provides valuable insight on its diversity and evolutionary origin. Several classification systems are published and listed in the table above.
Use a classification system (e.g. Pfam or Interpro ) to identify the main domains of a protein. Pfam presents also their occurrence in living organisms as a sunburst plot.
This figure displays the diversity of the domain P53 and can be retrieved from Pfam. This domain is present in virtually all animals, and some of their close relatives, such as choanoflagellates, and suggests that it appeared before the divergence between animals and these protists.

Identification of homologues
Studying the evolution of a family of genes or proteins requires the identification of homologues, i.e. , genes or protein with shared ancestry. Homologues include orthologues, that are present in different species, and paralogues, that are present in the same genome. Bioinformatic tools can be used to identify gene or protein homology based on sequence similarity, in the genomes of any species (see the list below). Here is a list of tools that can be used to identify sequence homology.
A | B | C |
---|---|---|
BLAST | Gene or protein homology search from NCBI | https://blast.ncbi.nlm.nih.gov/Blast.cgi |
BLAT | Sequence homology search in animal genomes | https://genome.ucsc.edu/cgi-bin/hgBlat |
Ensembl | Genome browser of vertebrates, includes tools for identification of homology | https://m.ensembl.org/index.html |
FASTA | Sequence search against protein databases and sequence alignment | https://www.ebi.ac.uk/Tools/sss/fasta/ |
HMMER | Gene and protein homology search | http://hmmer.org/ |
Pfam | Protein families and domains, includes tools for identification of homology | http://pfam.xfam.org/ |
SSAHA | Gene sequence search and alignment | https://www.sanger.ac.uk/tool/ssaha/ |
UniProt | General information on proteins, including simple homology identification | https://www.uniprot.org/uniprot/ |
List of bioinformatic tools for identification of gene and protein homologues
Using BLAST , paste the sequence of your gene of interest (in our example, human TP53) under fasta format to identify homologues in the genomes of select other species covering the diversity of animals (e.g. all animals and choanoflagellates), for example.
Select the sequences the low E.value and significant homology (typically >30% identity) for a wide range of animal species and download them under fasta format in another fasta file.
In our example, we are studying the evolution of TP53 in animals. It is relevant to select the 3 paralogues (TP53, TP63 and TP73) of a selection of species covering the diversity of animals and their relatives. In our example, we chose the cnidarian Hydra vulgaris , the fruit fly Drosophila melanogaster, the urochordate Ciona intestinalis, and the teleost fish Danio rerio, the sarcopterygian Latimeria chalumnae, the amphibian Xenopus tropicalis, the reptile Anolis carolinensis, the bird Gallus gallus, and the mammals Bos taurus and H. sapiens. Only vertebrates possess 3 paralogues.
Multiple sequence alignment
Phylogenetic analysis requires identifying homologous bases or amino acid residues between the different sequences. Homology is inferred by a sequence alignment. The sequences are put in every row one after the other to arrange every homologous base or amino acid in the same column.
Use an appropriate tool (e.g. ClustalW) to generate an alignment of the sequences. Insertions and deletions are indicated by gaps "-" added to the sequences. Once the alignment is completed, it can be exported using fasta format.
Here is a list of tools for nucleic acid or amino acid sequence alignment.
A | B | C |
---|---|---|
Software | Features | Link |
BAli-Phy | Multiple sequence alignment of nucleotide and amino acid sequences and phylogenetic analysis using a bayesian approach | http://www.bali-phy.org |
CLUSTAL Omega* | Speed-oriented multiple sequence alignment for nucleotide or amino acid data, suitable for large datasets | https://www.ebi.ac.uk/Tools/msa/clustalo/ |
CLUSTALW* | Multiple sequence alignment for nucleotide or amino acid data | https://www.genome.jp/tools-bin/clustalw |
CONTRAlign (ProbCons) | Accuracy-oriented multiple sequence alignment for amino acid data | http://contra.stanford.edu/contralign/ |
Kalign* | Multiple sequence alignment for nucleotide or amino acid data, suitable for large datasets | https://www.ebi.ac.uk/Tools/msa/kalign/ |
MAFFT* | Accuracy-oriented multiple sequence alignment for nucleotide or amino acid data | https://mafft.cbrc.jp/alignment/server/ |
MUSCLE* | Multiple sequence alignment for nucleotide or amino acid data | https://www.ebi.ac.uk/Tools/msa/muscle/ |
PASTA | Speed-oriented multiple sequence alignment for nucleotide or amino acid data, designed for very large datasets | https://bioinformaticshome.com/tools/msa/descriptions/PASTA.html |
PRANK/ WebPRANK* | Speed-oriented multiple sequence alignment for nucleotide or amino acid data, should be preferred for close sequences and large datasets | http://wasabiapp.org/software/prank/ https://www.ebi.ac.uk/goldman-srv/webprank/ |
SATé | Software package for multiple sequence alignments and phylogenetic inference | https ://phylo.bio.ku.edu/software/sate/sate.html |
T-COFFEE* | Accuracy-oriented multiple sequence alignment of nucleotide and amino acid sequences | http://tcoffee.crg.cat/ |
UPP | Speed-oriented multiple sequence alignment of nucleotide and amino acid sequences, designed for very large data sets | https://github.com/smirarab/sepp. |
List of programs for sequence alignment (* indicates a web interface)
Alignment trimming
It is recommended to check the alignment and, when necessary, to improve it manually or using alignment trimming tools. Trimming is the selection of phylogenetically informative sites in the alignment. Poorly aligned positions and highly variable regions are not phylogenetically informative, because these positions might not be homologous or subject to saturation. These positions should be excluded prior to the phylogenetic analysis to maximize the phylogenetic signal of the alignment. Then, the alignment can be exported using fasta format.
Use one of these tools (e.g. Guidance 2) to compute the completeness of your alignment and exclude the poorly aligned regions (regions of the alignment with low scores). Save the new alignment in fasta.
Here is a selection of tools to quantify the completeness of alignments and selection of the phylogenetic informative regions of the alignment.
A | B | C |
---|---|---|
Software | Features | Link |
AliStat | Selection of informative regions on multiple sequence alignments | https://github.com/thomaskf/AliStat |
BMGE | Selection of informative regions on multiple sequence alignments | https://gitlab.pasteur.fr/GIPhy/BMGE |
GBlocks | Selection of informative regions on multiple sequence alignments | http://molevol.cmima.csic.es/castresana/Gblocks.html |
Guidance 2* | Selection of informative regions on multiple sequence alignments | http://guidance.tau.ac.il/ |
Noisy | Selection of informative regions on multiple sequence alignments | http://www.bioinf.uni-leipzig.de/Software/noisy/ |
trimAl | Selection of informative regions on multiple sequence alignments | http://trimal.cgenomics.org/ |
List of programs for sequence alignment trimming (* indicates a web interface)
Assessing phylogenetic assumptions (optional)
Phylogenetic models rely on simplifying assumptions stating for example that all sites in the alignment evolved under the same tree, that mutation rates have remained constant, and that substitutions are reversible. If the phylogenetic data violate these assumptions, the phylogeny and evolutionary analyses can be biased. Once the alignment is performed and the sites selected for phylogenetic inference, it is recommended to assess those phylogenetic assumptions when possible. Several statistical methods have been developped. Tests for all these assumptions have been included in IQ-TREE and R (package MOTMOT).
Phylogenetic analysis
Phylogenetic inference
The evolutionary history of genes, proteins or species is generally presented as a phylogenetic tree, a graphical illustration of the evolutionary relationships between the studied taxa. Several methods for phylogenetic inference exist: Maximum Parsimony, the distance-based methods and the probabilistic methods. The probabilis| A | | --- | | |
e most widely used for molecular data, but other methods can be used in complement.
Choose one or several phylogenetic methods to reconstruct the evolutionary history of your gene, protein or species of interest. See the sub-steps below for the specificities of each method. It can be interesting to use several approaches and compare the results (e.g. Maximum Likelihood, Neighbour Joining and Bayesian Inference).
He| A | | --- | | |
reconstruction using diverse methods, and visualization of phylogenetic trees that can be used in complement.
A | B | C |
---|---|---|
Software | Features | Link |
APE | R-written package for molecular phylogenetics | http://ape-package.ird.fr |
BAli-Phy | Sequence alignment and phylogenetic inference using a Bayesian approach | http://www.bali-phy.org |
BayesTraits | Phylogenetic inference and other evolutionary analyses using Bayesian inference | http://www.evolution.reading.ac.uk/BayesTraitsV4.0.0/BayesTraitsV4.0.0.html |
BEAST | Diverse evolutionary analyses using Bayesian inference, including phylogenetic analysis, calibration and molecular clock | http://www.beast.community |
ETE Toolkit | Visualization and analysis of phylogenetic trees | http://etetoolkit.org/ |
FastMe | Fast phylogenetic inference using distance methods. | http ://www.atgc-montpellier.fr/fastme/ |
FastTree | Phylogenetic inference using ML for nucleotide (GTR and JC models) and amino acid (JTT and WAG models), and Shimodaira-Hasegawa test. Suitable for very large datasets. | http://www.microbesonline.org/fasttree/ |
FigTree | Graphic software for phylogenetic trees | http://tree.bio.ed.ac.uk/software/figtree/ |
GARLI | Phylogenetic inference using ML for nucleotide (GTR model), amino acid (most models) or codon data, with Gamma law and proportion of invariant sites. | http ://evomics.org/resources/software/molecular-evolution-software/garli/ |
HYPHY* | Diverse evolutionary analyses including evolution model selection, phylogenetic inference using ML and distance methods and sequence evolution studies | https ://www.hyphy.org/ |
IQ-TREE | ML phylogenetic inference, including model selection and ultrafast bootstrapping method. Includes the GHOST evolution model and tests for phylogenetic assumptions. | http ://www.iqtree.org/ |
ITOL | Visualization and annotation of phylogenetic trees | https://itol.embl.de/ |
MEGA | Sequence alignment, model selection, phylogenetic analysis (parsimony, distance methods). Includes all common nucleotide and amino acid evolution models, Gamma law and proportion of invariant sites, and Bootstrapping method. | https://www.megasoftware.net/ |
MrBayes | Bayesian phylogenetic inference, ancestral states reconstruction, phylogenetic calibration and other evolutionary analyses | http://nbisweden.github.io/MrBayes/ |
PAML | Maximum likelihood phylogenetic inference, estimation of selection strength, ancestral states reconstruction and other analyses | http://abacus.gene.ucl.ac.uk/software/paml.html |
PAUP | Phylogenetic inference using maximum parsimony and ML on nucleotide sequences (all ModelTest models), with Gamma law and proportion of invariant sites and bootstrapping method. | http://paup.phylosolutions.com/ |
PHYLIP | Phylogenetic inference using parsimony, distance methods and ML | https ://evolution.genetics.washington.edu/phylip.html |
PhyloBayes | Phylogenetic inference using Bayesian inference on proteins using a specific probabilistic model | http ://www.atgc-montpellier.fr/phylobayes/ |
PhyML* | Phylogenetic inference using ML, ancestral states reconstruction and various evolutionary analyses. Includes all common DNA and protein evolution models and diverse branch support methods (Bootstrap, Shimodaira-Hasegawa, aLTR…). | https://github.com/stephaneguindon/phyml Web interface : http://atgc.lirmm.fr/phyml/ |
PyCogent | Phylogenetic inference and phylogeny drawing, various evolutionary analyses including partition models and ancestral states reconstruction | https ://github.com/pycogent/pycogent |
RAxML | Phylogenetic inference using ML with nucleotide (GTR) or amino acid data (all common models) with Gamma law or CAT and proportion of invariant sites. Suitable for large datasets. | https://cme.h-its.org/exelixis/web/software/raxml/ |
SeaView | Sequence alignment and phylogenetic inference using maximum parsimony, NJ and ML | http://doua.prabi.fr/software/seaview |
SplitsTree | Phylogenetic inference, in particular unrooted trees, or phylogenetic networks | https ://uni-tuebingen.de/fakultaeten/mathematisch-naturwissenschaftliche-fakultaet/fachbereiche/informatik/lehrstuehle/algorithms-in-bioinformatics/software/splitstree/ |
List of programs and databases for phylogenetic analysis using diverse methods (* indicates a web interface)
Option 1: Maximum parsimony
Maximum parsimony is a classic and simple method, that calculates the minimum number of evolutionary steps, including nucleotide insertions, deletions or substitutions, between species.
However, this method ignores hidden mutations and does not take into account branch lengths, potentially leading to long branch attraction, which is an incorrect clustering of unrelated taxa. Furthermore, it does not consider the possibility of hidden mutations, making it not relevant for distant taxa. While maximum parsimony is still used for morphological data, it is no longer considered relevant for molecular data.
PAUP , MEGA , SeaView and Phylip can be used for phylogenetic analysis using maximum parsimony.
Option 2: Distance-based methods
Distance-based methods create a matrix of molecular distance, defined by the number of differences between the sequences, to reconstruct the phylogenetic tree. Several distance-based methods exist. The Unweighted Pair Group Method with Arithmetic mean (UPGMA), Neighbor Joining (NJ), and Minimum Evolution (ME) are all based on the overall molecular distance, defined by the number of differences between the sequences. Distance-based methods also ignores hidden mutations and is also subject to long branch attraction.
FastME , PAUP , MEGA , FastTree or Phylip can be used for distance-based methods.
Option 3: Probabilistic methods (requires selection of the molecular evolution model, see below)
The strength of probabilistic methods is the use of specified models of molecular evolution. Probabilistic methods take into account different mutation rates between sites to avoid mutation saturation. Nowadays, almost all studies of phylogenetic reconstruction use probabilistic methods.
Probabilistic methods include Maximum Likelihood (ML) and Bayesian inference (BI). ML calculates the probability of observing the data (in this case, the sequence alignment) under different explicit models of molecular evolution. ML aims to identify the best fit model by exploring multiple combinations of model parameters. BI evaluates the probability of each model of molecular evolution given the data. We recommend using different methods of phylogenetic reconstruction, including ML and BI.
Probabilistic methods require selection of the molecular evolution model.
Selection of the molecular evolution model (for probabilistic methods only)
Prior to phylogenetic analysis, probabilistic methods require selection of the model of molecular evolution that best describes the data. Nucleotide or amino acid substitution models exist. The nucleotide substitution models differ in the number of parameters considered, like mutation rates and base frequencies. The main nucleotide substitution models are, from the simplest to the most complex: JC69, K80, F81, HKY85, TN93, GTR. The main amino acid substitution models include JTT, WAG, LG and Dayhoff. Each model can be associated with the proportion of invariant nucleotide or amino acid sites (+I) or with the Gamma distribution (+G), which corresponds to a mutation rate heterogeneity between sites. More recently, the GHOST model for alignments with variation in mutation rate was introduced and implemented in IQ-TREE .
The likelihood of the different models should be computed by appropriate software. For every substitution model, these tools calculate the Bayesian information criterion ( BIC ) and the Akaike information criterion ( AIC ) from the log-likelihood scores. A model with lower AIC is considered more accurate. The model optimizing BIC or AIC ( i.e., with the lowest score) should be selected. Molecular evolution model selectors are also included in MEGA and PhyML ( SMS ).
Using e.g. jModelTest , compute the BIC and AIC of the different models on the alignment. Select the model optimizing (minimizing) those criterions.
You will use this model to compute the phylogenetic tree of your protein.
Here is a selection of tools that can be used for molecular evolution model selection.
A | B | C |
---|---|---|
Software | Features | Link |
ModelFinder | Fast model selection with a model of rate heterogeneity between sites (nucleotide, amino acids or codons) | Implemented in IQ-TREE |
ModelTest / jModelTest | Nucleotide substitution model selection | http://evomics.org/resources/software/molecular-evolution-software/modeltest/ |
PartitionFinder 2 | Molecular evolution model selection (nucleotide or amino acids) | http://www.robertlanfear.com/partitionfinder/ |
ProtTest | Aminoacid substitution model selection | https://github.com/ddarriba/prottest3 |
SMS | Molecular evolution model selection included in PhyML (nucleotide or aminoacid) | http://www.atgc-montpellier.fr/sms/ |
List of programs for molecular evolution model selection
Maximum Likelihood
Maximum Likelihood methods calculate the probability of observing the data under different explicit models of molecular evolution. Many programs for maximum-likelyhood-based phylogenetic analysis exist, that can be accuracy-oriented or speed-oriented.
Choose a program according to the size of your dataset (see Table in Step 6). For beginners, we recommend SeaView or MEGA , which include several tools for sequence alignment, phylogenetic inference including probabilistic methods and others, and a tree editor. PhyML is accurate, easy of use and, like PAUP and MEGA , includes all common models of molecular evolution. PhyML also includes a web interface. RAxML and particularly FastTree are fast and well suited for large datasets (up to 1 million sequences with FastTree). They use only a restricted a specific model of rate heterogeneity, in addition to Gamma law and proportion of invariant sites. Like Garli , their choice of nucleotide evolution model is limited to GTR. IQ-TREE , that includes ModelFinder and a very fast bootstrapping method, is reported to be both fast and accurate. PAUP is slower than other programs, and uses nucleotide data only.
In our example, we used Maximum likelihood to build the phylogenetic tree of the P53 family (including the p73, p63 and p53 proteins of different animal species). Use an appropriate programme (e.g. MEGA ) to reconstruct the phylogeny of P53 family using Maximum Likelihood with the appropriate model of evolution.
Bayesian inference
The most recent method for phylogenetic reconstruction uses Bayesian inference, which calculates the probability of the molecular evolution model given the data. The main software used for BI-based phylogenetics are MrBayes and BEAST , that use the Markov Chain Monte Carlo (MCMC) algorithm. PhyloBayes is a Bayesian MCMC sampler for phylogenetic reconstruction with protein data using a specific probabilistic model, well adapted for large datasets and phylogenomics. Bali-Phy can also be used for phylogenetic analysis using Bayesian inference.
Tree rooting
The root of a phylogenetic tree is the hypothetical last common ancestor of all the taxa present in the tree. Depending on the addressed question, phylogenetic trees can be unrooted or rooted. The latter corresponds to the identification of ancestral and derived states, aiming at studying the direction of the evolution and interpreting the evolution of the studied taxa. Diverse methods have been developed to root phylogenetic trees. The most common require including outgroups in the analysis. Outgroups are taxa that do not belong to the studied ingroup but are closely related. Typically, two outgroups are selected, one being more closely related to the ingroup than to the other outgroup, allowing for a proper identification of the states of characters (ancestral or derived).
In our example, we include the P53 homologues from the choanoflagellate Monosiga brevicollis and the Cnidarian Hydra vulgaris .
When outgroups are not identified, like in the case of viruses, alternative methods can be used. For example, midpoint rooting places the root at the mid-point of the longest branches, and molecular clock rooting assumes that evolution speed is constant between the sequences.
Tree drawing
Once the phylogenetic tree has been computed, it can be exported using e.g. Newick file format and visualized using a graphical software such as FigTree , ETE Toolkit or ITOL . MEGA and SeaView also include visualization tools. Using different sets of options, several types of phylogenetic trees can be drawn (rooted or not, cladogram or phylogram), and branch support values (bootstrap values or posterior probabilities) can be displayed.
In our example, we used FigTree for the graphical representation of the phylogenetic tree of the P53 family.

Working with phylogenies
Reconstruct the evolution of the gene or protein
Sequence alignments and phylogenetic trees can be used to reconstruct diverse aspects of the evolutionary history of genes, proteins and species, as well as the study the genetic structures within populations. In this last section, we provide a brief and non-exhaustive overview of evolutionary studies that can be performed using bioinformatic tools.
Reconstitution of ancestral states
Retracing the functional evolution of genes, proteins, or biological traits often requires the reconstitution of ancestral states. Ancestral states can be inferred from a phylogenetic tree using maximum of parsimony, maximum likelihood, or Bayesian inference; and requires the aligned sequences and the model of molecular evolution that has been used for the phylogenetic analysis (when using probabilistic methods only).
A | B | C |
---|---|---|
Software | Features | Link |
BayesTraits | Evolutionary analyses using Bayesian inference | http://www.evolution.rdg.ac.uk/BayesTraitsV3.0.5/BayesTraitsV3.0.5.html |
BEAST | Diverse evolutionary analyses using Bayesian inference, including phylogenetic analysis, calibration and molecular clock | http://www.beast.community |
Mega | Sequence alignment, model selection, phylogenetic analysis (parsimony, distance methods) | https://www.megasoftware.net/ |
Mesquite | Comparative analyses and statistics | http://www.mesquiteproject.org/ |
MrBayes | Bayesian phylogenetic inference, ancestral states reconstruction, phylogenetic calibration… | http://nbisweden.github.io/MrBayes/ |
PAML | Maximum likelihood phylogenetic inference, estimation of selection strength, ancestral states reconstruction and other analyses | http://abacus.gene.ucl.ac.uk/software/paml.html |
RASP | Ancestral states reconstruction | http://mnh.scu.edu.cn/soft/blog/RASP/index.html |
List of programs and databases for ancestral states reconstruction
Measure of selection strength
The strength of selection on protein coding genes can be calculated by evaluating the ration of the number of non-synonymous mutations (mutations changing the protein sequence) per non-synonymous site (dN) and the number of synonymous mutations (mutations with no effect on the protein sequence due to the redundancy of the genetic code) per synonymous site (dS). The ratio dN/dS reveals, if >1, that non-synonymous mutations are higher than expected and the gene is under positive selection. If dN/dS<1, the gene is under purifying selection and if dN/dS=1, the selection is neutral. Selection strength can be calculated by sequence alignment programs such as MEGA .
Phylogenetic calibration
Phylogenetic calibration consists in estimating the age of speciation events (the nodes in the phylogenetic tree) with events with a known age, for example fossil data and other geological data (that can only give minimal ages) as calibration points. Alternatively, mutation rates can be used to calculate the divergence time between two sequences.
Databases such as Timetree , also implemented in MEGA , compute the estimated divergence time between species. Mesquite also provides tools to calibrate phylogenetic trees in geological times using fossil data. Ohnologs can also be used to estimate the divergence time between homologues resulting from whole genome duplications in vertebrates.
A | B | C |
---|---|---|
Software | Features | Link |
BayesTraits | Evolutionary analyses using Bayesian inference | http://www.evolution.rdg.ac.uk/BayesTraitsV3.0.5/BayesTraitsV3.0.5.html |
BEAST | Diverse evolutionary analyses using Bayesian inference, including phylogenetic analysis, calibration and molecular clock | http://www.beast.community |
Mega | Sequence alignment, model selection, phylogenetic analysis (parsimony, distance methods) | https://www.megasoftware.net/ |
Mesquite | Comparative analyses and statistics | http://www.mesquiteproject.org/ |
MrBayes | Bayesian phylogenetic inference, ancestral states reconstruction, phylogenetic calibration… | http://nbisweden.github.io/MrBayes/ |
Ohnologs | Database of vertebrate ohnologues, resulting from whole genome duplications | http://ohnologs.curie.fr/ |
Timetree | Tree calibration | http://www.timetree.org/ |
List of programs and databases that can be used for phylogenetic calibration using diverse methods
Study of coevolution
Evolutionary ecology and parasitology often study the evolution of hosts and parasites association through an approach of coevolution, or reciprocal genetic change between different species. Coevolution can also be used to study associations of genes or proteins. Studying coevolution consists in identifying evolutionary events such as co-speciation, host change, and duplication or loss of interaction from the phylogenetic trees of the hosts and the parasites.
In our second example, we used Maximum likelihood to reconstruct the evolution of cyclins and CDKs, two families of proteins involved in cell cycle control and closely interacting. We used Jane and TreeMap to reconstruct coevolutionary scenarios (cophylogenies) between the two families.
In both figures, the clustering of cyclins and CDKs indicate an interaction (in this case, that the cyclin can bind the CDK). Red spots indicate significant events of coevolution between the two families of proteins. Co-speciation (hollow red circle), duplications (solid red circle), duplications with host switch (yellow circle), loss of interaction (dashed lines), failures to diverge (jagged lines) are indicated on the figure.
A | B | C |
---|---|---|
Software | Features | Link |
Copycat | Software for studying coevolution | http://www.cophylogenetics.com/ |
Core-PA | Software for studying coevolution | http://pacosy.informatik.uni-leipzig.de/49-1-CoRe-PA.html |
Jane | Software for studying coevolution | https://www.cs.hmc.edu/~hadas/jane/ |
TreeMap | Software for studying co-evolution | https://sites.google.com/site/cophylogeny/treemap |
List of programs to study coevolution

Genome evolution
Evolutionary events, such as mutations, insertions, deletions, gene or whole genome duplications, genome reorganization, and genetic exchanges can be identified using phylogenetic trees in complement with genomics tools and databases. Here is a list of databases tools to study diverse aspects of genome evolution, including genome browsers of diverse lineages and tools for comparative genomics and evolutionary genomics:
A | B | C |
---|---|---|
Software | Features | Link |
BAR | Database of plant genes and proteins | http://bar.utoronto.ca/ |
CAFE | Gene family evolution | https://github.com/hahnlab/CAFE5 |
CoGE | Comparative genomics analyses | https://genomevolution.org/coge/ |
Ensembl | Genome browser of vertebrates, includes tools for identification of homology | https://www.ensembl.org/index.html |
Entrez | Gene sequences and structures | https://www.ncbi.nlm.nih.gov/Web/Search/entrezfs.html |
FlyBase | Genome and proteins of the model insect D. melanogaster | https://flybase.org/ |
GenBank | Annotated DNA sequences | https://www.ncbi.nlm.nih.gov/genbank/ |
HGT-Finder | Horizontal gene transfer finding | http://cys.bios.niu.edu/HGTFinder/ HGTFinder.tar.gz |
Ohnologs | Database of vertebrate ohnologues, resulting from whole genome duplications | http://ohnologs.curie.fr/ |
PomBase | Genes and proteins of the model yeast S. pombe | https://www.pombase.org/ |
TAIR | Genome and proteins of the model plant A. thaliana | https://www.arabidopsis.org/ |
WormBase | Genome and proteins of the model nematode C. elegans | https://wormbase.org//#012-34-5 |
Xenbase | Genome and proteins of the model amphibian X. laevis | http://www.xenbase.org/entry/ |
List of programs and databases to study genome evolution
Phylogenetic comparative analysis
Evolutionary biology often employs the so-called phylogenetic comparative methods to study the adaptive significance of biological traits. These methods aim at identifying biological characters, in terms of morphology, physiology, or ecology, that result from a shared ancestry. Comparative analyses can be done for quantitative or qualitative variables. Mesquite is a very appropriate tool for comparative analysis and to compute statistics on phylogenetic trees. BayesTraits can also be used.
Population genetics
Genetic diversity can also be explored at the population level by analyzing polymorphism between members of the same species. Gene polymorphisms studies allele diversity within a population, including single nucleotide polymorphisms (SNP), indels, microsatellites or transposable elements. Mathematical models have been developed to describe polymorphism. Several programs are suitable for population genetics studies.
A | B | C |
---|---|---|
Software | Features | Link |
DNAsp | Analysis of DNA polymorphism | http://www.ub.edu/dnasp/ |
Genepop | Population genetics analyses | https://genepop.curtin.edu.au/ |
SNiplay | SNP detection and other population genetics analyses | https://sniplay.southgreen.fr/cgi-bin/home.cgi |
Arlequin | Population genetics analyses | http://cmpg.unibe.ch/software/arlequin35/ |
List of programs and databases for population genetics
Study of protein structure and function evolution
Studying the functional evolution of proteins can require structure alignments, that can be realized by appropriate programs such as PyMol , and the mean distance in Å between homologous residues can be calculated .
Protein structures are described by databases such as the Protein Data Bank ( PDB ). The PDB provides the 3-dimensional structures of proteins and their interacting ligands established by X-ray crystallography, electron microscopy, or NMR spectroscopy, which can be retrieved as pdb files. PDB also displays a 3D visualization tool, programs for 3D analyses such as pairwise structure alignment and pairwise symmetry, and cross links to other protein databases. Annotation for protein families based on fingerprints, i.e ., conserved 3-dimensional motifs specific for a protein family, are gathered in the database PRINTS . PRINTS includes a 3D visualization software and search tools for protein sequence homology and pairwise sequence alignment or multiple sequence alignment.
Other programs allow to identify and study structures conservation between proteins and infer a structures from a protein sequence. These analyses and a phylogenetic tree of the protein families, together with the reconstruction of ancestral states, can facilitate the study of the evolution of protein functions within the family.
Here is a list of tools that can be used for analyses on protein structures in an evolutionary framework:
A | B | C |
---|---|---|
Software | Features | Link |
PyMol | 3D visualization of molecules | http://www.mesquiteproject.org/ |
PDB | 3-dimensional structures of proteins | https://www.rcsb.org/ |
Prints | Protein fingerprints classification | http://www.bioinf.man.ac.uk/dbbrowser/PRINTS/ |
PyMol | 3D visualization of molecules and diverse analyses on protein structures | https://pymol.org/2/ |
I-Tasser | Protein structure prediction | https://zhanglab.dcmb.med.umich.edu/I-TASSER/ |
Forsa | Protein structure prediction | http://www.bo-protscience.fr/forsa/ |
HHPred | Protein structure prediction | https://toolkit.tuebingen.mpg.de/tools/hhpred |
List of programs for protein structure analyses