A Guide to Gene-Centric Analysis Using TreeSAPP

Connor Morgan-Lang, Connor Morgan-Lang, Steven J. Hallam, Steven J. Hallam

Published: 2023-02-21 DOI: 10.1002/cpz1.671

Abstract

Gene-centric analysis is commonly used to chart the structure, function, and activity of microbial communities in natural and engineered environments. A common approach is to create custom ad hoc reference marker gene sets, but these come with the typical disadvantages of inaccuracy and limited utility beyond assigning query sequences taxonomic labels. The Tree-based Sensitive and Accurate Phylogenetic Profiler (TreeSAPP) software package standardizes analysis of phylogenetic and functional marker genes and improves predictive performance using a classification algorithm that leverages information-rich reference packages consisting of a multiple sequence alignment, a profile hidden Markov model, taxonomic lineage information, and a phylogenetic tree. Here, we provide a set of protocols that link the various analysis modules in TreeSAPP into a coherent process that both informs and directs the user experience. This workflow, initiated from a collection of candidate reference sequences, progresses through construction and refinement of a reference package to marker identification and normalized relative abundance calculations for homologous sequences in metagenomic and metatranscriptomic datasets. The alpha subunit of methyl-coenzyme M reductase (McrA) involved in biological methane cycling is presented as a use case given its dual role as a phylogenetic and functional marker gene driving an ecologically relevant process. These protocols fill several gaps in prior TreeSAPP documentation and provide best practices for reference package construction and refinement, including manual curation steps from trusted sources in support of reproducible gene-centric analysis. © 2023 The Authors. Current Protocols published by Wiley Periodicals LLC.

Basic Protocol 1 : Creating reference packages

Support Protocol 1 : Installing TreeSAPP

Support Protocol 2 : Annotating traits within a phylogenetic context

Basic Protocol 2 : Updating reference packages

Basic Protocol 3 : Calculating relative abundance of genes in metagenomic and metatranscriptomic datasets

INTRODUCTION

We live in a world dominated by bacterial and archaeal (prokaryotic) microorganisms, with an estimated 1030 prokaryotic cells on Earth (Whitman, Coleman, & Wiebe, 1998). Collectively, this microcosmos drives matter and energy conversion processes that define the conditions for life, now and through the deep expanse of geological time (Falkowski, Fenchel, & Delong, 2008). Recent advances in high-throughput sequencing and mass spectrometry platforms have enabled us to develop more quantitative insights into the structure and function of the microcosmos at the individual, population, and community levels of biological organization (Hawley, Brewer, Norbeck, Paša-Tolić, & Hallam, 2014; Hawley et al., 2017; Nayfach et al., 2020). The resulting datasets have revealed incredible diversity, metabolic resilience, and adaptive potential, providing new opportunities for basic and applied research. However, our ability to attribute metabolic processes to specific microbes in the environment is currently limited.

Charting microbial community structure, function, and activity is hampered by query and reference dataset size and complexity (Nasko et al., 2018). Isolate and single-cell genomes retain the inherent link between microbial taxonomy and metabolic function (Pachiadaki et al., 2019; Rinke et al., 2013), but typically lack the necessary throughput to capture community-level diversity. Information acquired through genome-resolved metagenomics can capture more of this diversity (often tens to hundreds of genomes per sample; Meyer et al., 2022) at lower expense. However, interpreting these data can be challenging for environments that are underrepresented in public sequence databases. For example, classification algorithms often fail to account for potential sequence divergence that can lead to false attributions under these conditions. Moreover, common genome-resolved workflows fail to analyze the entire gene content contained within metagenomes based on a subset of the data that can be accurately binned.

The Tree-based Sensitive and Accurate Phylogenetic Profiler (TreeSAPP) software package enables taxonomic profiling of selected marker genes using phylogenetic placement (Morgan-Lang et al., 2020). It was developed to chart the structure, function, and activity of microbial communities in natural and engineered environments with increased precision. TreeSAPP relies on custom databases called reference packages (to identify homologous query sequences in organismal and multi-organismal datasets) and on taxonomic classification. A reference package contains a multiple sequence alignment, a profile hidden Markov model (HMM), taxonomic lineage information, and a phylogenetic tree to represent a set of reference marker genes. In the classification workflow, homologous sequences identified by reference package HMMs are aligned to reference sequences and placed on a phylogeny to infer evolutionary relationships to known protein sequences. Once created, reference packages can be updated and refined as new sequence information becomes available, including isolate reference genomes, single-cell amplified genomes (SAGs), and metagenome-assembled genomes (MAGs), thereby improving the precision of taxonomic classification. Moreover, unlike genome-centric metagenomic workflows in which a subset of the assembled sequence information is used, TreeSAPP classifies all homologous sequences found in reads or contigs, thereby providing a more complete community profile with accurate taxonomic assignment. TreeSAPP outputs can be rendered using the Interactive Tree of Life (iTOL) tool (Letunic & Bork, 2019) or with programmatic data visualization libraries to generate beautiful and informative figures. TreeSAPP has recently been used to identify glycoside hydrolase enzymes with novel blood antigen–cleaving activities (Rahfeld, Sim, et al., 2019; Rahfeld, Wardman, et al., 2019; Wardman et al., 2021) and to classify lignin-depolymerizing enzymes into subclasses (Levy-Booth et al., 2022).

Here, we provide guidance on working with reference packages and implementing TreeSAPP modules (Fig. 1). Basic Protocol 1 describes reference package construction from a set of candidate protein sequences using treesapp create. In Basic Protocol 2, the reference package is updated to include novel sequences from complete microbial genomes and metagenome-assembled genomes (MAGs) using treesapp assign and treesapp update. Finally, Basic Protocol 3 describes how to quantify assigned sequences by mapping reads from metagenome and metatranscriptome datasets using treesapp abundance. The Support Protocols 1 and 2 explain installation of TreeSAPP, best practices for acquiring protein sequences for reference package construction from trusted sources, and annotation of traits on reference package phylogenies to enable automated trait-based annotation of queries. These protocols are naturally dependent on the comparative framework afforded by reference packages but agnostic with respect to the specific orthologous group or protein family represented.

TreeSAPP software package modules for creation and quality control of reference packages and for assignment and visualization of protein sequence profiles using phylogenetic trees.
TreeSAPP software package modules for creation and quality control of reference packages and for assignment and visualization of protein sequence profiles using phylogenetic trees.

For instructional purposes we focus on the alpha subunit of methyl-coenzyme M reductase (McrA), the enzyme responsible for biogenic methane production and anaerobic methane oxidation in archaea, as a primary use case. The MCR enzyme facilitates the terminal reductive step in methanogenesis, where a methyl group is liberated from coenzyme M (Thauer, 1998). When the pathway operates in reverse to oxidize methane under anaerobic conditions, MCR binds methane, or even short-chain alkanes such as ethane and butane, to coenzyme M (Hallam et al., 2004; Orphan, House, Hinrichs, McKeegan, & DeLong, 2002). As the canonical marker for methane metabolism in archaea, McrA provides a plethora of mechanistic information including crystal structures, process rates, and co-factor requirements for many cultured archaeal lineages. Moreover, the phylogeny of McrA is concordant with small-subunit ribosomal RNA phylogeny, making it both a phylogenetic and functional anchor supporting gene-centric analysis (Dhillon et al., 2005; Hallam, Girguis, Preston, Richardson, & DeLong, 2003; Lueders, Chin, Conrad, & Friedrich, 2001; McKay, Hatzenpichler, Inskeep, & Fields, 2017). The TreeSAPP workflow presented here for McrA is extensible to any protein with sufficient conservation e.g., orthologous groups, and a source of trusted reference sequence information.

Basic Protocol 1: CREATING REFERENCE PACKAGES

This protocol describes a workflow to create and validate a TreeSAPP reference package. Candidate reference McrA protein sequences are provided. These protein sequences were retrieved from FunGene (Fish et al., 2013), RefSeq (Wheeler et al., 2007), and recent publications describing novel lineages with the potential to metabolize methane or other short-chain alkanes (Borrel et al., 2019; Boyd et al., 2019; Chen et al., 2019; Evans et al., 2015; McKay et al., 2019; Nobu, Narihiro, Kuroda, Mei, & Liu, 2016; Sorokin et al., 2017; Vanwonterghem et al., 2016; Wang, Wegener, Hou, Wang, & Xiao, 2019). Subsequent curation steps within TreeSAPP are used to select for a nonredundant set of representative sequences that maximize phylogenetic breadth of the reference package.

Taxonomic lineage information of candidate reference sequences can be provided by either the input FASTA or a table. FASTA sequence headers from RefSeq contain an Entrez-compatible accession identifier that can be queried and linked to a sequence's NCBI taxonomic lineage. This is the easiest way to provide lineage information, but also restricts the sources that can be used to create and update a reference package. To overcome this limitation, a tab-delimited table can also be provided, where lineages in SILVA or GTDB format are mapped to sequence accessions (i.e., all text before the first whitespace character) (Parks et al., 2018; Quast et al., 2013). Both modes are used in this workflow.

Reference package purity—achieved when only orthologous sequences are classified within a given reference package, thereby indicating all reference sequences are homologous—is confirmed by classifying sequences based on accurate functional annotations. This is accomplished by using a trusted source such as MetaCyc, TIGRFAM, or Swiss-Prot (Caspi et al., 2006; Haft et al., 2013; UniProt Consortium, 2021). The reported coverage metric is a function of the number of phylogenetic tree leaves that are descendants of query sequence placements. A basal placement would accordingly yield 100% coverage and suggest remote homology between the reference set and the query's orthologous group. Conversely, low coverage where the queries map to the leaf tips indicates a close relationship to the reference(s). For this protocol the TIGRFAM database seed (61,111 sequences from 4488 TIGR families) is used by treesapp purity along with an optional table describing TIGR families to enrich the output. Although the TIGRFAM database seed is well suited for curating reference packages, it is no longer being actively developed. Alternative sources, such as eggNOG and Swiss-Prot (Bairoch & Apweiler, 1997; Huerta-Cepas et al., 2019), will be needed in the future as more reference packages are constructed for user-defined protein coding genes of interest.

Necessary Resources

Hardware

  • To efficiently implement TreeSAPP, 32 Gb RAM and at least a quad-core processor (up to eight threads) are required. In addition, 40 Gb free hard disk space is required to store the inputs and outputs (including temporary files) of the protocols.

Software

  • A computer running either a 64-bit Linux-based distribution (e.g., Ubuntu) or MacOS is required.
  • To install TreeSAPP and its dependencies, either Conda, Docker, or Singularity must be installed. To install Conda, execute the following commands, where the “$DIST ” variable must be replaced by “Linux” or “MacOSX”, depending on the computer's operating system.
  • _curl -O _
  • https://repo.anaconda.com/miniconda/Miniconda3-latest-$DIST-x86\\_64.sh
  • bash Miniconda3-latest-$DIST-x86_64.sh
  • To install Docker, follow the instructions at https://docs.docker.com/get-docker/. To install Singularity, follow the instructions at https://sylabs.io/guides/3.0/user-guide/installation.html (Linux only).
  • To install TreeSAPP, see Support Protocol 1.Support Protocol 2 requires users to have an iTOL account and subscription. This is only required for visualizing phylogenetic trees with query sequences placed; few alternatives that import JPlace files exist (Matsen, Hoffman, Gallagher, & Stamatakis, 2012).
  • This protocol also requires a web browser such as Chrome, Safari, or Firefox.

Files

  • Change your directory to a location on your computer's file system where you are able to write data for these protocols. It is suggested to create a new directory and then move into it to invoke TreeSAPP commands. The data archive is 1.5 GB and will require ∼3 minutes to download, assuming a 10-MB/s internet download speed.
  • mkdir treesapp_protocols
  • cd treesapp_protocols
  • wget https://zenodo.org/record/7499961/files/treesapp_protocols_data.tar.gz
  • mkdir protocols_reqs; tar -xzf treesapp_protocols_data.tar.gz -C protocols_reqs/
  • The last two commands will download a manifest and the files required for all Basic and Support Protocols into the current working directory.

1.Install TreeSAPP as described (see Support Protocol 1).

2.Create a reference package with TreeSAPP.

The TreeSAPP subcommand used to build new reference packages is treesapp create. This command will query the Entrez server and gather taxonomic lineage information for all sequences, deduplicate sequences by clustering, align sequences together, and infer a phylogeny.

  • treesapp create --refpkg_name McrA \
  • --fastx_input protocols_reqs/McrA_seed_proteome.faa \
  • --seqs2lineage protocols_reqs/lineage_table.tsv \
  • --output bp1_create/ \
  • --min_seq_length 400 --similarity 0.97 --min_taxonomic_rank p \
  • --headless --trim_align --cluster --fast --num_procs 8

An explanation of the above command line arguments follows. –fastx_input and –output control the paths to the FASTA for creating the reference package and the directory to write the outputs, respectively. –seqs2lineage points to a table mapping a subset of the sequence accessions to taxonomic lineages. –min_seq_length was set to 400, thereby excluding all sequences shorter than 400 amino acids from the reference package. This corresponds to ∼75% of the full-length McrA protein sequence and may well need adjustment depending on the typical length of the reference sequences. Several optional treesapp create flags are also used, including –trim_align , which invokes multiple sequence alignment trimming to retain phylogenetically informative sites and remove those dominated by gaps; –min_taxonomic_rank , which ensures only reference sequences resolved to at least this rank prefix (‘d’ for domain, ‘p’ for phylum, etc.) are included in the reference package; –fast , which selects FastTree2 over RAxML-NG for phylogenetic inference; –headless , which skips manual selection of centroid sequences for each cluster; –cluster , which activates candidate reference sequence deduplication with MMSeqs2 pairwise alignment clustering; and –similarity , which sets the identity threshold (in this case, 0.97). This final parameter has widespread impact on reference package utility. Values approaching 1.0 will retain more reference sequences, allowing for relatively more resolved taxonomic classifications at the cost of greater resource expenditure to compute multiple alignments and phylogenetic placement. This parameter can be tuned to produce a reference package of reasonable size, allowing for classifications within an acceptable amount of time, with 1000 sequences being a reasonable upper bound.

3.Check reference package purity.

The TIGRFAM database seed is used to identify whether any unexpected orthologous groups were included in the initial reference packages. Internally, treesapp purity calls treesapp assign with the target reference package before summarizing the prevalence of hits in the phylogeny.

  • treesapp purity \
  • --refpkg_path bp1_create/final_outputs/McrA_build.pkl \
  • -i protocols_reqs/TIGRFAM_seed_named.faa \
  • -x protocols_reqs/TIGRFAM_info.tsv \
  • -o bp1_purity/ \
  • --num_procs 4

The output indicates that four query sequences (i.e., TIGRFAM hits) were classified (Table 1). Sequences were placed at four tips of the tree and covered 2.2% of the phylogeny. Since TIGR03256 represents the McrA protein family, the reference package is assumed to be pure.

Table 1. Treesapp Purity Classification Results for the McrA Reference Packagea
TIGR family Hits Leaves Coverage (%) Description
TIGR03256 4 4 2.2 Methyl-coenzyme M reductase, alpha subunit
  • a

    Hits refer to the number of query (TIGRFAM) sequences classified by the corresponding reference package. Coverage is the percentage of the reference package's phylogenetic tree leaves that were descendants of those hits. Greater coverage suggests the query sequences were placed more deeply in the phylogeny.

Support Protocol 1: INSTALLING TreeSAPP

TreeSAPP was written in Python 3 but relies on multiple software dependencies that must also be installed for proper functionality. TreeSAPP and its dependencies can be installed either by Conda or an image of a software container. Regardless of the installation method, the TreeSAPP software and protocols herein will only work on computers running Linux or Mac operating systems.

Option 1: Installation using the Conda package manager

The recommended and most efficient method for installing TreeSAPP is by way of the platform-independent package manager, Conda. Package managers such as Conda are reliable tools that install a desired software and its respective dependencies while also ensuring all dependency versions are compatible with one another. The following two commands create a new Conda environment called treesapp cenv (this can be changed, just remember to be consistent), install TreeSAPP and its dependencies in this environment, and then activate the environment so that TreeSAPP is ready to be used.

  • conda init bash
  • conda create -n treesapp_cenv -c bioconda -c conda-forge treesapp
  • conda activate treesapp_cenv

Option 2: Running TreeSAPP through a container-based image

TreeSAPP can also be invoked through two container-based platforms, ensuring consistent versions and therefore maximal reproducibility between analyses. TreeSAPP containers are offered for both Docker and Singularity. Docker is suitable for compute environments where the user has root privileges (e.g., cloud systems and personal computers), whereas Singularity is amenable to systems where the user is operating on systems without root privileges (e.g., departmental servers and scientific computing grids). These images are updated with every version of TreeSAPP released.

Docker

The latest Docker container can be pulled from a Quay.io repository via the following command.

  • docker pull quay.io/hallamlab/treesapp:latest

In most if not all cases, some directory of the local system will need to be bound to make it accessible to the Docker container. Otherwise, TreeSAPP will not be able to read the inputs you provide or write the outputs to your local storage. More information on what is required can be obtained from the official Docker documentation (docker service create; see Internet Resources). In the following example command, the user's Desktop directory is bound to the Docker container's /data directory so all files in Desktop can be accessed via /data.

  • docker run -t -i -d \
  • –mount type=bind,src=/home/$USER/Desktop,dst=/data \
  • quay.io/hallamlab/treesapp:latest

Singularity

The latest Singularity container can be downloaded from the Singularity repository by:

  • singularity pull library://cmorganl/default/treesapp:latest

The image will be downloaded in Singularity Image Format (.sif) as treesapp_latest.sif.

Invocation

Running TreeSAPP through the containers requires slight modification to all commands. With the Docker container, the first step is to find out the name of the running image, then execute commands within the image. This can be accomplished by the following commands:

  • docker ps # to find the name, thirsty_shockley as an example
  • docker exec thirsty_shockley treesapp assign -h

Using a Singularity container is very similar to using a Docker image, but instead of referencing a running image, Singularity is pointed to the downloaded .sif file.

  • singularity exec treesapp_latest.sif treesapp assign -h

Similar to Docker, if the input or output data are located on a drive other than the boot drive, the mounted drive(s) will need to be bound with the –bind argument. For example, to mount the /Desktop directory as before:

  • singularity exec –bind /home/$USER/Desktop:/data \
  • treesapp_latest.sif treesapp assign \
  • -i /data/input.fasta -o /data/assign_out/

Support Protocol 2: ANNOTATING TRAITS WITHIN A PHYLOGENETIC CONTEXT

In this protocol, taxa in the reference package will be associated with their respective traits, i.e., different methane and short-chain alkane transformations. Trait annotations can be used both to automate trait classification of queries with treesapp layer and to illustrate trait linkages on a phylogenetic tree using the iTOL web interface with treesapp colour.

1.Install TreeSAPP as described (see Support Protocol 1).

2.Loading trait annotations.

Trait annotations are entered into the reference package with the treesapp package subcommand. treesapp package requires the path to a reference package (.pkl), a trait label, and a tabular file to link taxa to traits. Methane metabolism pathways were compiled for each taxon in the McrA reference package based on a literature review. All taxa were associated with a single orthologous group. The first column in the resulting table is a taxon label (e.g., g__Methanothrix) and the second column is the trait (e.g., Aceticlastic).

  • mkdir sp2/
  • treesapp package edit feature_annotations Pathways \
  • --refpkg_path protocols_reqs/refpkg/update/McrA_build.pkl \
  • --taxa_map protocols_reqs/Mcr_taxonomy-phenotype_map.tsv \
  • --output sp2/

Optionally, this reference package can be overwritten to include the trait annotations by replacing the –output parameter with the –overwrite flag:

  • treesapp package edit feature_annotations Pathways \
  • –refpkg_path protocols_reqs/refpkg/update/McrA_build.pkl \
  • –taxa_map protocols_reqs/Mcr_taxonomy-phenotype_map.tsv \
  • –overwrite

These traits were written to the feature annotations attribute of the McrA reference package and indexed under the “Pathways” label. By indexing each trait separately, multiple sets of traits can be affiliated with each reference sequence. For example, further traits could be entered into the reference package designating the substrate ranges, ecosystem preferences, or enzyme kinetics exhibited by taxa. Once all trait-based annotations are saved, reference packages are ready for trait classification.

3.Perform trait classification with treesapp layer.

Assignment at trait-level resolution occurs in two steps. First, treesapp assign classifies query sequences to the reference package, then treesapp layer maps query sequences to features according to their location in the tree (i.e., placement edge).

  • treesapp layer \
  • --refpkg_dir sp2/ \
  • --treesapp_output bp3/metag/sak_2013_06_06_120m/

The output of this procedure is a new classification table called layered_classifications.tsv , with a column for each trait in the reference package found in bp3/metag/sak_2013_06_06_120m/final outputs/.

4.Create iTOL color and style files.

The color subcommand generates files to visualize taxonomy or other features inherent to the phylogeny (such as orthologous groups) in iTOL (Letunic & Bork, 2019). A color-styles file colors the phylogeny's clades or labels and adds a legend, while the color-strip file creates a colorized strip encompassing the phylogeny. Multiple layers can be created to convey sophisticated evolutionary information in a digestible format. The treesapp colour commands used to paint the phylogeny of the McrA reference package are as follows:

  • treesapp colour --palette RdYlBu \
  • --attribute Pathways \
  • --unknown_colour gray \
  • -r sp2/McrA_build.pkl
  • treesapp colour --palette BrBG \
  • --rank_level class \
  • -r sp2/McrA_build.pkl

Four text files are written to the current working directory:

  • McrA_class_colours_style.txt
  • McrA_class_colour_strip.txt
  • McrA_Pathways_colours_style.txt
  • McrA_Pathways_colour_strip.txt.

5.View annotations in iTOL.

The iTOL tool can be used to visualize phylogenetic trees with placements by reading the .jplace file generated by treesapp assign. All files (excluding color and style files created in the previous step) are in the iTOL output directory in a treesapp assign output directory. After navigating a web browser to https://itol.embl.de/ and logging in, change the page to “My Trees”. Once there, upload the McrA complete_profile.jplace file from bp3/metag/sak 2013_06_0 (or the output for any other sample). Click on the highlighted file name to go to the iTOL viewing window. At this point, the annotation files can be uploaded by clicking and dragging McrA_labels.txt and the colors and styles files from step three. Without much more editing, the result should look identical to Figure 2.

A fully annotated McrA phylogeny visualized in the Interactive Tree of Life (iTOL).
A fully annotated McrA phylogeny visualized in the Interactive Tree of Life (iTOL).

Basic Protocol 2: UPDATING REFERENCE PACKAGES

This protocol describes how to update a reference package for McrA, first with homologous sequences identified in complete reference genomes and then with MAGs sourced from the Genomes from Earth's Microbiomes (GEM) catalog (Nayfach et al., 2020), to ensure the reference package is up to date with respect to known lineages. Figure 3 demonstrates this process. Reference packages are not intended to be static depictions of microbial diversity within a single database, or even our collective understanding in the current moment. Rather, they must grow alongside an expanding tree of life. To this end, reference packages can be updated by traversing a genomic information hierarchy encompassing different levels of biological organization and genome complexity (Basher, McLaughlin, & Hallam, 2020). While isolate genomes are the most reliable source of reference sequences, integrating SAGs and MAGs can increase the representation of new clades within a reference package. This is especially relevant when profiling environments where endemic diversity may not be represented in reference packages derived solely on the basis of public repositories or isolate sequences. However, specific caveats apply when using metagenomic sequence information to update reference packages. Bowers et al. (2017) have established community standards for evaluating the quality of SAGs and MAGs based on completeness and contamination (and occasionally strain-heterogeneity) metrics, calculated using software applications such as CheckM (Parks, Imelfort, Skennerton, Hugenholtz, & Tyson, 2015). Although SAGs are inherently incomplete, they are organismal in nature and should be used when available to update reference packages. Medium- and high-quality MAGs can also be used to update reference packages with caution. Although MAGs can be rendered as circular genomes (Singleton et al., 2021), they represent a closely related population of microbes and frequently contain misclassified material (Meziti et al., 2021; Shaiber, Eren, Shaiber, & Eren, 2019). Following these guidelines, the McrA reference package will be iteratively updated using complete reference genomes and MAGs. Archaeal SAGs were not leveraged here, as previously conducted searches found that the limited number available in public databases did not encode novel MCR subunits.

Conceptual diagram of the iterative update workflow with subsequent placement of identical McrA query sequences on the reference package phylogenies. All tree branches are consistent with the scale provided.
Conceptual diagram of the iterative update workflow with subsequent placement of identical McrA query sequences on the reference package phylogenies. All tree branches are consistent with the scale provided.

TreeSAPP integrates homologous sequences from proteomes and genomes into reference packages. It is unable, however, to update a reference package using protein sequences directly from a FASTA file. Rather, protein sequences must first be classified and organized by the module treesapp assign before they can be integrated into a reference package. Thus, each treesapp update command is preceded by a call to treesapp assign in this protocol.

1.Install TreeSAPP as described (see Support Protocol 1).

2.Assign query sequences from reference proteomes.

The first of two updates will leverage 1.8 million protein sequences encoded by 774 complete reference genomes. By default, treesapp assign searches and classifies query sequences using reference packages that are installed with the software package. In the following command, the McrA reference package created using Basic Protocol 1 is used instead by providing its parent directory with the –refpkg_dir argument.

  • treesapp assign \
  • -i protocols_reqs/RefSeq_Archaea_CompleteGenomes.fasta \
  • -o bp2_refseq_assign/ \
  • --refpkg_dir protocols_reqs/refpkg/seed/ \
  • --num_procs 8 --trim_align

All McrA homologs have been classified by treesapp assign and formatted for treesapp update to use directly from the output. The number of sequences classified can be counted by using either:

  • grep -c "ˆ>" bp2_refseq_assign/final_outputs/*_classified.faa

which counts the number of lines matching a header pattern in the FASTA containing classified protein sequences, or

  • tail -n +2 bp2_refseq_assign/final_outputs/classifications.tsv | wc -l

which counts the number of lines in the classification table file excluding the header. In either case, the commands should return the number of classified sequences as 259.

3.Update reference package with McrA homologs.

Taking the 259 classified McrA sequences, it is now possible to integrate novel sequences into the existing reference set and create a new reference package. Novel sequences are the centroids of clusters, inferred by pairwise-alignment clustering with MMSeqs2, that do not contain current reference sequences. As before, the cluster radius is controlled by the –similarity threshold. The treesapp assign output (bp2_refseq_assign/) is provided by the –treesapp_output parameter. treesapp update requires a path to a reference package to be updated through the –refpkg path parameter (note the difference between this parameter and –refpkg_dir in treesapp assign). Similar to treesapp create , treesapp update can pull taxonomic lineage labels from Entrez and replace the labels from treesapp assign. This mode is activated by the –skip_assign flag. The remaining parameters are identical to those used by treesapp create because that command is invoked once the novel sequences have been sorted.

  • treesapp update \
  • –treesapp_output bp2_refseq_assign/ \
  • –refpkg_path protocols_reqs/refpkg/seed/McrA_build.pkl \
  • -o bp2_refseq_update/ –num_procs 8 –similarity 0.97 \
  • –headless –fast –skip_assign –cluster

The runtime log printed to the terminal window records the number of candidate reference sequences, the number of assigned sequences to be used in the update (i.e., novel sequences), and the number of references (or leaf nodes) in the updated reference package. The updated McrA reference package should contain 183 sequences.

4.Assign query sequences from archaeal MAGs

The McrA homologs encoded by archaeal MAGs must be classified, analogous to the process used for McrA encoded in reference genomes. One difference between these datasets is that the MAGs are provided as genomic sequences, and therefore open reading frames (ORFs) must first be predicted and conceptually translated. While the command-line arguments are unaffected, it will take more time to identify McrA homologs due to this additional step.

  • treesapp assign \
  • -i protocols_reqs/IMG_Archaeal_MAG_MCR_contigs.fasta \
  • -o bp2_mag_assign/ \
  • –refpkg_dir bp2_refseq_update/final_outputs/ \
  • –num_procs 8 –trim_align

The following command shows 885 McrA homologs were identified.

  • grep -c "^>" bp2_mag_assign/final_outputs/*_classified.faa

5.Update reference package with McrA homologs.

The taxonomic identities of these MAGs were predicted using GTDB-Tk (Chaumeil, Mussig, Hugenholtz, & Parks, 2019). Given that these MAGs were not accessioned in the NCBI database, their taxonomic lineages need to be passed to treesapp update in a tabular file with the –seqs2lineage parameter. This file contains two fields: a sequence ID and a SILVA- or GTDB-formatted lineage. It can be viewed with the following command:

  • less -S protocols_reqs/GEM_seqs2lineage.tsv

The second reference package update iteration can be accomplished with these MAG-derived McrA sequences to produce a more comprehensive and representative McrA reference package.

  • treesapp update \
  • --treesapp_output bp2_mag_assign/ \
  • --refpkg_path bp2_refseq_update/final_outputs/McrA_build.pkl \
  • -o bp2_mag_update/ --num_procs 8 --similarity 0.97 \
  • --seqs2lineage protocols_reqs/GEM_seqs2lineage.tsv \
  • --headless --fast --skip_assign --cluster

There should be 414 sequences/leaves in the reference package after this second update.

Basic Protocol 3: CALCULATING RELATIVE ABUNDANCE OF GENES IN METAGENOMIC AND METATRANSCRIPTOMIC DATASETS

This protocol calculates normalized relative abundance values for classified query sequences. To demonstrate this, TreeSAPP will be used to calculate abundance of McrA genes and transcripts using metagenomic and metatranscriptomic datasets sourced from three depths in meromictic Sakinaw Lake, British Columbia, Canada. The Sakinaw Lake water column is stratified, traversing fresh and oxic surface waters (to ∼30 m depth), a sharp chemocline (∼10 m), and saline and reduced (i.e., sulfidic and anoxic) basin water (to 140 m) (Perry & Pedersen, 1993). The redox transition zone (i.e., sulfate-methane transition zone) lies at 45 m (Vagle, Hume, McLaughlin, MacIsaac, & Shortreed, 2010). Previous studies indicate that Sakinaw Lake harbors numerous uncultivated microbial dark matter lineages (Gies, Konwar, Beatty, & Hallam, 2014; Rinke et al., 2013).

The relative abundance calculation requires four steps. After installation, the treesapp assign subcommand identifies sequences with putative homology to a reference package's sequences and places those in the reference phylogeny. treesapp abundance aligns a sample's reads to all ORF nucleotide sequences with BWA-MEM (Li, 2013), then uses the samsum package (https://github.com/hallamlab/samsum) to calculate the transcripts per million (TPM) value of each classified ORF (Wagner, Kin, & Lynch, 2012). treesapp assign is capable of both classification and relative abundance calculation by calling treesapp abundance internally, as shown in step 2 of this protocol. Alternatively, treesapp abundance can be executed independently, with the main benefit of being able to recruit reads from multiple datasets in a single command. Regardless of the approach used, the ORFs must have been predicted by TreeSAPP for the read alignments to be properly mapped. The protocol concludes with a visual comparison of McrA TPM between the metagenomic and metatranscriptomic datasets.

1.Install TreeSAPP as described (see Support Protocol 1).

2.Create an updated McrA reference package as described (see Basic Protocol 2).

3.Classify metagenome homologs and calculate TPM in one step.

In this step treesapp assign is used to predict, classify, and assign taxonomy to conceptually translated McrA ORFs. It also calls treesapp abundance internally to calculate TPM. A shell for loop is used to run treesapp assign for each metagenome fasta-fastq pair.

  • for spath in protocols_reqs/SL/meta*/sak_2013_*m/
  • do
  • output=spath | awk -F/ ’{ OFS="/"; print 4 }’)
  • mkdir -p bp3/$output
  • treesapp assign \
  • -i output/ \
  • --reads $spath/*_2M.fastq.gz \
  • --rel_abund --pairing pe \
  • --refpkg_dir bp2_mag_update/final_outputs/ \
  • --num_procs 8 --trim_align --delete
  • done

In the end, there should be 6 classification table files with 110 assigned McrA sequences.

  • cat bp3/meta*/sak_2013_*m/final_outputs/classifications.tsv | \
  • grep -v Query | wc -l

4.Calculate abundance of McrA transcripts.

Here, metatranscriptomic reads are recruited to ORFs predicted from the metagenomes for each depth. Instead of creating a new output directory, the TPM values are appended to the classification tables (controlled by –report append), such that relative abundances for both the metagenomic and metatranscriptomic datasets are in the same file.

  • for metat in protocols_reqs/SL/metat/sak_2013_*m/
  • do
  • sid=metat | awk -F/ ’{ print $4 }’)
  • treesapp abundance \
  • --treesapp_output bp3/metag/$sid/ \
  • --reads $metat/*_2M.fastq.gz --pairing pe \
  • --refpkg_dir bp2_mag_update/final_outputs/ \
  • --report append --num_procs 8 --delete
  • done

5.Visualize differential abundances.

The classification tables in each output directory's final_outputs/classifications.tsv are ready to be analyzed. They can be concatenated using the commands:

  • cat bp3/meta*/sak*m/final_outputs/classifications.tsv | \
  • head -n 1 >bp3/combined_classifications.tsv
  • for f in bp3/meta*/sak*m/final_outputs/classifications.tsv; do
  • tail -n +2 $f >>bp3/combined_classifications.tsv
  • done

The R-script protocols_reqs/treesapp_visualiser.R generates plots from the classifications with the following command:

  • Rscript protocols_reqs/treesapp_visualiser.R \
  • -t bp3/combined_classifications.tsv -o bp3/ -s protocols_reqs/sample_name_map.tsv

McrA-encoding archaea are predominantly affiliated with the orders Methanosarcinales and Methanomicrobiales (Fig. 4). The methanotrophic Methanophagales (also known as ANME-1) were neither abundant nor transcriptionally active at the depths analyzed. McrA transcripts were generally more abundant than gene copies in Sakinaw Lake at the depths and times analyzed. Furthermore, metatranscriptomic TPM values derived from alignments using metagenome ORFs as the reference were greater than those using metatranscriptome ORFs. Several factors may have led to the lower recruitment rates for metatranscriptome ORFs, including poor assembly of McrA transcripts in these samples.

Note
A two-million read (i.e., one million mate-pairs) subset of the sequenced reads was processed for each sample to reduce computation time. For this reason, the taxonomic abundance barplot created will be different than that shown in the publication. The version in the publication was retained to convey scientific accuracy, and should resemble the version created by this tutorial closely enough to indicate whether the protocol worked as intended.

Plot generated from relative abundance values associated with Mcr genes and transcripts identified in Sakinaw Lake, as calculated by TreeSAPP. Key: MetaG, TPM calculated by aligning metagenome reads to metagenome ORFs; MetaT, calculated from metatranscriptome reads aligned to metatranscriptome ORFs; MetaT.to.MetaG, calculated from metatranscriptome read alignments to metagenome ORFs.
Plot generated from relative abundance values associated with Mcr genes and transcripts identified in Sakinaw Lake, as calculated by TreeSAPP. Key: MetaG, TPM calculated by aligning metagenome reads to metagenome ORFs; MetaT, calculated from metatranscriptome reads aligned to metatranscriptome ORFs; MetaT.to.MetaG, calculated from metatranscriptome read alignments to metagenome ORFs.

COMMENTARY

Background Information

TreeSAPP is one of the latest gene-centric analysis tools for microbial ecology. It was inspired by the now-deprecated MLTreeMap, a software package that visualized the taxonomic profile of query sequences on a reference package's phylogeny (Stark, Berger, Stamatakis, & von Mering, 2010). Similar to MLTreeMap, TreeSAPP uses phylogenetic placement to assign queries a taxonomic label due to its increased precision over pairwise alignment alone (Berger, Krompass, & Stamatakis, 2011; Morgan-Lang et al., 2020). This improvement comes with a computational cost when compared to pairwise alignment methods such as DIAMOND or BLAST (Altschul, Gish, Miller, Myers, & Lipman, 1990; Buchfink, Xie, & Huson, 2014). Moreover, a reference or backbone phylogeny is also required to leverage phylogenetic placement, increasing the overhead before classification can begin. Two tools with similar functionality and implementation are GraftM and PhyloMagnet (Boyd et al., 2018; Schön et al., 2020). The four primary benefits of using TreeSAPP over these tools are: automated collection of taxonomic lineages for accessioned sequences to reduce manual effort in building reference packages; increased classification performance; automated annotation of traits; and export of .jplace and annotation files for visualizing placement of queries on backbone phylogenies. The lone objective advantage of GraftM over TreeSAPP is faster processing, since the better-supported and more-popular tools Prodigal and EPA-NG were used over OrfM and pplacer (Matsen et al., 2010; Woodcroft et al., 2016) for ORF prediction and phylogenetic placement, respectively.

Analyzing microbial communities from a gene-centric perspective is most valuable when complemented by more-comprehensive and higher-throughput sequence annotation such as with MetaPathways (Konwar et al., 2015) or eggNOG-mapper (Huerta-Cepas et al., 2017; Konwar et al., 2015). Community-wide metabolic potentials inferred by these tools are readily queried and focused on metabolisms of interest. At this point, TreeSAPP can be used to better taxonomically profile specific reactions through the use of marker gene reference packages. Alternatively, TreeSAPP is able to query many datasets at scale to identify samples with relevant features for more in-depth analysis using comprehensive annotation workflows. Finally, TreeSAPP can be applied in isolation to identify marker genes of interest and to quantify them with respect to genome, transcript, or peptide abundance. Numerous other applications potentially exist, including primer/probe development, and we welcome community feedback to further develop and improve TreeSAPP.

Critical Parameters

Default parameters are recommended for most use cases and should not be tuned without a strong understanding of their impact on TreeSAPP performance. These defaults have been set according to benchmarks covering a diverse range of reference packages. Nonetheless, some parameters may need to be modified or invoked in some cases, and those that have not already been described in the protocols are listed below.

treesapp create

guarantee can be used to ensure that sequences make it into the final set of reference sequences, regardless of whether they would normally be removed by filters and clustering. Its argument is a FASTA file. These do not need to be in the FASTA provided to –fasta_input.

profile can be used to extract homologous regions from the input sequences with a profile HMM that models the target domain or protein family. For example, if proteins in the target orthologous group manifest multiple different architectures, an HMM can be provided to trim the inputs to a core conserved domain or locus. This has an additional benefit of improving the multiple alignment and phylogeny. This is useful if the sequences being used to build a reference package were downloaded from one or more databases and have not been manually validated.

deduplicate performs a high-identity (99.9%) sequence clustering to reduce the number of time-consuming queries made to the Entrez database.

treesapp update

min_seq_length restricts protein sequences that are shorter than this value from being added to the reference package. Sequences falling below this threshold are removed from the set of classified sequences immediately so as to not interfere with downstream clustering that determines whether candidate reference sequences fill in missing phylogenetic space (i.e., not redundant with the existing reference set). By default the value is set to two-thirds of the reference package's profile HMM length.

Troubleshooting

A list of possible errors and solutions is provided in Table 2.

Table 2. Sources of and Solutions to Potential Errors
Problem Possible cause Solution
An error mentioning a file already exists TreeSAPP exited from a previous run incorrectly and is attempting to reuse outputs Execute the TreeSAPP subcommand with the –overwrite flag or change the output directory path
No hits to reference package with treesapp purity The reference database (e.g., TIGRFAM seed) does not include sequences homologous to the reference package Use a larger, more comprehensive database such as eggNOG
“Clade exclusion analysis could not be performed for training the reference package models” from treesapp create This error can arise when the reference package being created is very small or if the reference sequences lack species-level taxonomic labels The reference package .pkl file can be used, but treesapp assign results will be prone to over-classification. Try updating with more sequences and retraining the reference package, if possible.

Understanding Results

Given that the use case results presented in these protocols represent ideal conditions with a highly curated dataset, a more general discussion is needed to prepare users for gene-centric analysis using TreeSAPP with their own datasets. Emphasis is placed on using treesapp create and assign based on the recognition that quality reference packages are integral to producing biologically meaningful results.

Users are encouraged to manually review each reference package prior to use, even published reference packages or those available on the TreeSAPP Github repository, to ensure that relevant sequences are present for the lineage or environment under study. In some cases, candidate reference sequences may not represent true orthologs, leaving room for functionally dissimilar sequences to be included in a reference package. Of the multiple reference package components, the phylogenetic tree is most indicative of quality. By scrutinizing the phylogenetic tree, users can quickly spot long branches and polyphyletic clades that could indicate gene duplication, horizontal gene transfer, or non-orthologous relationships. For example, a sequence or clade that is separated from the majority of the nodes by a long branch can be confirmed by treesapp purity (see Basic Protocol 1) and resolved by removing unwanted sequences from the candidate set and rebuilding the reference package with treesapp create. Manual inspection of these sequences is also recommended to ensure that they were annotated correctly.

TreeSAPP classifications are generally accurate when assigning protein sequences to orthologous groups and taxonomic lineages. However, under several circumstances the precision of TreeSAPP classification can be diminished. These include the presence of promiscuous protein domains in otherwise nonhomologous sequences and the evaluation of datasets from underrepresented communities containing novel sequences or lineages.

Hazards of gene-centric analysis. Promiscuous protein domains may lead to false positive classifications, and these may be recognized by referring to each query's evolutionary distance in the classification table. Generally, any distance value of ∼1.0 or greater could be a false positive and—by combining a priori knowledge such as relationships to other proteins—can be used to implement custom filters in downstream data processing and visualization workflows. For example, if a reference package is built for a cytochrome-type protein where domains are shared with many functionally diverse enzymes, implementing a stricter evolutionary distance value may be beneficial. Alternatively, reference packages can be updated and annotated with “bait” sequences that contain the shared domain but are not the target protein. Post-classification, these can be automatically distinguished from the target homologs using treesapp layer and removed from further analysis.

Novel communities. In some instances, TreeSAPP may not resolve taxonomic lineages for a subset of sequences within a microbial community, particularly an underrepresented community containing novel sequences or lineages. This could indicate a biological signal for one or more deeply branching clades or simply poor representation of related sequences within the reference package. This issue can be resolved by iteratively updating the reference package with isolate, SAG, or MAG sequences sourced from the same or similar environment under study (as in Basic Protocol 2). The query sequence FASTA files will need to be reprocessed entirely with the updated reference package.

In addition to classification challenges, users should be aware of how best to interpret TreeSAPP outputs related to count data. Specifically, taxonomic abundance profiles should not be inferred from TreeSAPP classification results without also invoking treesapp abundance with sequence read data (as in Basic Protocol 3). Without calculating normalized abundance values from read mapping data, the raw counts represent the number of homologs identified in the assembled unique sequence fragments (i.e., contigs or scaffolds). These sequences represent the richness of one or more orthologous groups and could be used to estimate the functional redundancy or taxonomic diversity of a community depending on the types of reference packages queried. For more quantitative comparisons between samples, normalized abundance based on read mapping is needed.

Time Considerations

The Basic Protocols are expected to take a total of 70 min to complete: ∼15 min for Basic Protocol 1; ∼20 min for Basic Protocol 2; and ∼35 min for Basic Protocol 3. Additionally, installing TreeSAPP and downloading the requisite data package is expected to require up to 1 hr. The majority of this time is not hands-on and depends on a user's internet download speed and location (relative to Zenodo). Installing TreeSAPP can be completed while downloading the data package in parallel.

Acknowledgments

We would like to thank Tanya Woyke, Natalia Ivanova, Frederik Schulz, and Daniel Udwary at the Joint Genome Institute (JGI, Berkeley, California) for providing metagenomic data from IMG/M and assistance with the National Energy Research Scientific Computing (NERSC) Center systems. Many thanks to Simonetta Gribaldo and Guillaume Borrel at the Institut Pasteur (Paris) for validating metabolic annotations and providing software recommendations. We also thank Kishori Konwar for formative conversations and preliminary code used to guide early TreeSAPP development cycles, as well as members of the Hallam lab and MICB425 Microbial Ecological Genomics for providing iterative feedback and beta testing.

This work was performed under the auspices of the Natural Sciences and Engineering Research Council (NSERC) of Canada, Genome Canada, Genome British Columbia, the Digital Research Alliance of Canada, the G. Unger Vetlesen and Ambrose Monell Foundations, the U.S. Department of Energy (DOE) Joint Genome Institute (JGI), and the Facilities Integrating Collaborations for User Science (FICUS) JGI-NERSC (National Energy Research Scientific Computing) project 503401, supported by the Office of Science of US DOE Contract DE-AC02- 05CH11231N, through grants awarded to S.J.H. C.M.-L. was funded by the NSERC CREATE Bioinformatics Training Program at the University of British Columbia and Genome Canada.

Author Contributions

Connor Morgan-Lang : Conceptualization, data curation, formal analysis, investigation, methodology, resources, software, validation, visualization, writing original draft; Steven J. Hallam : Conceptualization, project administration, supervision, writing original draft, writing review and editing.

Conflict of Interest

S.J.H. is a co-founder of Koonkie, Inc., a bioinformatics consulting company that designs and provides scalable algorithmic and data analytics solutions in the cloud.

Open Research

Data Availability Statement

The data that support these protocols are openly available in Zenodo at doi: 10.5281/zenodo.7499961.

Literature Cited

  • Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment search tool. Journal of Molecular Biology , 215(3), 403–410. doi: 10.1016/S0022-2836(05)80360-2
  • Bairoch, A., & Apweiler, R. (1997). The SWISS-PROT protein sequence data bank and its supplement TrEMBL. Nucleic Acids Research , 25(1), 31–36. doi: 10.1093/nar/25.1.31
  • Basher, A. R. M. A., McLaughlin, R. J., & Hallam, S. J. (2020). Metabolic pathway inference using multi-label classification with rich pathway features. PLoS Computational Biology , 16(10), e1008174. doi: 10.1371/journal.pcbi.1008174
  • Berger, S. A., Krompass, D., & Stamatakis, A. (2011). Performance, accuracy, and web server for evolutionary placement of short sequence reads under maximum likelihood. Systematic Biology , 60(3), 291–302. doi: 10.1093/sysbio/syr010
  • Borrel, G., Adam, P. S., McKay, L. J., Chen, L.-X., Sierra-García, I. N., Sieber, C. M. K., … Gribaldo, S. (2019). Wide diversity of methane and short-chain alkane metabolisms in uncultured archaea. Nature Microbiology , 4(4), 603–613. doi: 10.1038/s41564-019-0363-3
  • Bowers, R. M., Kyrpides, N. C., Stepanauskas, R., Harmon-Smith, M., Doud, D., Reddy, T. B. K., … Woyke, T. (2017). Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nature Biotechnology , 35(8), 725–731. doi: 10.1038/nbt.3893
  • Boyd, J. A., Jungbluth, S. P., Leu, A. O., Evans, P. N., Woodcroft, B. J., Chadwick, G. L., … Tyson, G. W. (2019). Divergent methyl-coenzyme M reductase genes in a deep-subseafloor Archaeoglobi. The ISME Journal , 13, 1269–1279. doi: 10.1038/s41396-018-0343-2
  • Boyd, J. A., Woodcroft, B. J., & Tyson, G. W. (2018). GraftM: A tool for scalable, phylogenetically informed classification of genes within metagenomes. Nucleic Acids Resources , 46(10), e59. doi: 10.1093/nar/gky174
  • Buchfink, B., Xie, C., & Huson, D. H. (2014). Fast and sensitive protein alignment using DIAMOND. Nature Methods , 12(1), 59–60. doi: 10.1038/nmeth.3176
  • Caspi, R., Foerster, H., Fulcher, C. A., Hopkinson, R., Ingraham, J., Kaipa, P., … Karp, P. D. (2006). MetaCyc: A multiorganism database of metabolic pathways and enzymes. Nucleic Acids Research , 34, D511–D516. doi: 10.1093/nar/gkj128
  • Chaumeil, P.-A., Mussig, A. J., Hugenholtz, P., & Parks, D. H. (2019). GTDB-Tk: A toolkit to classify genomes with the genome taxonomy database. Bioinformatics , 36(November), 1–3. doi: 10.1093/bioinformatics/btz848
  • Chen, S.-C., Musat, N., Lechtenfeld, O. J., Paschke, H., Schmidt, M., Said, N., … Musat, F. (2019). Anaerobic oxidation of ethane by archaea from a marine hydrocarbon seep. Nature , 568, 108–111. doi: 10.1038/s41586-019-1063-0
  • Dhillon, A., Lever, M., Lloyd, K. G., Albert, D. B., Sogin, M. L., & Teske, A. (2005). Methanogen diversity evidenced by molecular characterization of methyl coenzyme M reductase A (mcrA) genes in hydrothermal sediments of the Guaymas Basin. Applied and Environmental Microbiology , 71(8), 4592–4601. doi: 10.1128/AEM.71.8.4592-4601.2005
  • Evans, P. N., Parks, D. H., Chadwick, G. L., Robbins, S. J., Orphan, V. J., Golding, S. D., & Tyson, G. W. (2015). Methane metabolism in the archaeal phylum Bathyarchaeota revealed by genome-centric metagenomics. Science , 350(6259), 434–438. doi: 10.1126/science.aac7745
  • Falkowski, P. G., Fenchel, T., & Delong, E. F. (2008). The microbial engines that drive Earth's biogeochemical cycles. Science , 320(5879), 1034–1039. doi: 10.1126/science.1153213
  • Fish, J. A., Chai, B., Wang, Q., Sun, Y., Brown, C. T., Tiedje, J. M., & Cole, J. R. (2013). FunGene: The functional gene pipeline and repository. Frontiers in Microbiology , 4, 1–14. doi: 10.3389/fmicb.2013.00291
  • Gies, E. A., Konwar, K. M., Beatty, J. T., & Hallam, S. J. (2014). Illuminating microbial dark matter in meromictic Sakinaw Lake. Applied and Environmental Microbiology , 80(21), 6807–6818. doi: 10.1128/AEM.01774-14
  • Haft, D. H., Selengut, J. D., Richter, R. A., Harkins, D., Basu, M. K., & Beck, E. (2013). TIGRFAMs and genome properties in 2013. Nucleic Acids Research , 41(D1), 387–395. doi: 10.1093/nar/gks1234
  • Hallam, S. J., Girguis, P. R., Preston, C. M., Richardson, P. M., & DeLong, E. F. (2003). Identification of methyl coenzyme M reductase A (mcrA) genes associated with methane-oxidizing archaea. Applied and Environmental Microbiology , 69(9), 5483–5491. doi: 10.1128/AEM.69.9.5483-5491.2003
  • Hallam, S. J., Putnam, N., Preston, C. M., Detter, J. C., Rokhsar, D., Richardson, P. M., & DeLong, E. F. (2004). Reverse methanogenesis: Testing the hypothesis with environmental genomics. Science , 305(5689), 1457–1462. doi: 10.1126/science.1100025
  • Hawley, A. K., Brewer, H. M., Norbeck, A. D., Paša-Tolić, L., & Hallam, S. J. (2014). Metaproteomics reveals differential modes of metabolic coupling among ubiquitous oxygen minimum zone microbes. Proceedings of the National Academy of Sciences of the United States of America , 111(31), 11395–11400. doi: 10.1073/pnas.1322132111
  • Hawley, A. K., Nobu, M. K., Wright, J. J., Durno, W. E., Morgan-Lang, C., Sage, B., … Hallam, S. J. (2017). Diverse Marinimicrobia bacteria may mediate coupled biogeochemical cycles along eco-thermodynamic gradients. Nature Communications , 8(1), 1507. doi: 10.1038/s41467-017-01376-9
  • Huerta-Cepas, J., Forslund, K., Coelho, L. P., Szklarczyk, D., Jensen, L. J., von Mering, C., & Bork, P. (2017). Fast genome-wide functional annotation through orthology assignment by eggNOG-mapper. Molecular Biology and Evolution , 34(8), 2115–2122. doi: 10.1093/molbev/msx148
  • Huerta-Cepas, J., Szklarczyk, D. H., Heller, D., Hernandez-Plaza, A., Forslund, S. K., Cook, H., … Bork, P. (2019). eggNOG 5.0: A hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Research , 47(D1), D309–D314. doi: 10.1093/nar/gky1085
  • Konwar, K. M., Hanson, N. W., Bhatia, M. P., Kim, D., Wu, S. J., Hahn, A. S., … Hallam, S. J. (2015). MetaPathways v2.5: Quantitative functional, taxonomic and usability improvements. Bioinformatics , 31(20), 3345–3347. doi: 10.1093/bioinformatics/btv361
  • Letunic, I., & Bork, P. (2019). Interactive Tree of Life (iTOL) v4: Recent updates and new developments. Nucleic Acids Research , 47(W1), W256–W259. doi: 10.1093/nar/gkz239
  • Levy-Booth, D. J., Navas, L. E., Fetherolf, M. M., Liu, L.-Y., Dalhuisen, T., Renneckar, S., … Mohn, W. W. (2022). Discovery of lignin-transforming bacteria and enzymes in thermophilic environments using stable isotope probing. The ISME Journal , 16, 1944–1956. doi: 10.1038/s41396-022-01241-8
  • Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. In arXiv.1303.3997 [q-bio.GN]. doi: 10.48550/arXiv.1303.3997
  • Lueders, T., Chin, K. J., Conrad, R., & Friedrich, M. (2001). Molecular analyses of methyl-coenzyme M reductase α-subunit (mcrA) genes in rice field soil and enrichment cultures reveal the methanogenic phenotype of a novel archaeal lineage. Environmental Microbiology , 3(3), 194–204. doi: 10.1046/j.1462-2920.2001.00179.x
  • Matsen, F. A., Hoffman, N. G., Gallagher, A., & Stamatakis, A. (2012). A format for phylogenetic placements. PLoS One , 7(2), 1–4. doi: 10.1371/journal.pone.0031009
  • Matsen, F. A., Kodner, R. B., & Armbrust, E. V. (2010). pplacer: Linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinformatics , 11, 538. doi: 10.1186/1471-2105-11-538
  • McKay, L. J., Dlakić, M., Fields, M. W., Delmont, T. O., Eren, A. M., Jay, Z. J., … Inskeep, W. P. (2019). Co-occurring genomic capacity for anaerobic methane and dissimilatory sulfur metabolisms discovered in the Korarchaeota. Nature Microbiology , 4, 614–622. doi: 10.1038/s41564-019-0362-4
  • McKay, L. J., Hatzenpichler, R., Inskeep, W. P., & Fields, M. W. (2017). Occurrence and expression of novel methyl-coenzyme M reductase gene (mcrA) variants in hot spring sediments. Scientific Reports , 7, 7252. doi: 10.1038/s41598-017-07354-x
  • Meyer, F., Fritz, A., Deng, Z.-L., Koslicki, D., Lesker, T. R., Gurevich, A., … McHardy, A. C. (2022). Critical assessment of metagenome interpretation: The second round of challenges. Nature Methods , 19(4), 429–440. doi:10.1038/s41592-022-01431-4
  • Meziti, A., Rodriguez-R, L. M., Hatt, J. K., Peña-Gonzalez, A., Levy, K., & Konstantinidis, K. T. (2021). The reliability of metagenome-assembled genomes (MAGs) in representing natural populations: Insights from comparing mags against isolate genomes derived from the same fecal sample. Applied and Environmental Microbiology , 87(6), e02593–20. doi: 10.1128/AEM.02593-20
  • Morgan-Lang, C., McLaughlin, R., Armstrong, Z., Zhang, G., Chan, K., & Hallam, S. J. (2020). TreeSAPP: The Tree-based Sensitive and Accurate Phylogenetic Profiler. Bioinformatics , 36(18), 4706–4713. doi: 10.1093/bioinformatics/btaa588
  • Nasko, D. J., Koren, S., Phillippy, A. M., & Treangen, T. J. (2018). RefSeq database growth influences the accuracy of k -mer-based lowest common ancestor species identification. Genome Biology , 19(1), 165. doi: 10.1186/s13059-018-1554-6
  • Nayfach, S., Roux, S., Seshadri, R., Udwary, D., Varghese, N., Schulz, F., … Eloe-Fadrosh, E. A. (2020). A genomic catalog of Earth's microbiomes. Nature Biotechnology , 39, 499–509. doi: 10.1038/s41587-020-0718-6
  • Nobu, M. K., Narihiro, T., Kuroda, K., Mei, R., & Liu, W.-T. (2016). Chasing the elusive Euryarchaeota class WSA2: Genomes reveal a uniquely fastidious methyl-reducing methanogen. The ISME Journal , 10(10), 2478–2487. doi: 10.1038/ismej.2016.33
  • Orphan, V. J., House, C. H., Hinrichs, K.-U., McKeegan, K. D., & DeLong, E. F. (2002). Multiple archaeal groups mediate methane oxidation in anoxic cold seep sediments. Proceedings of the National Academy of Sciences of the United States of America , 99(11), 7663–7668. doi: 10.1073/pnas.072210299
  • Pachiadaki, M. G., Brown, J. M., Brown, J., Bezuidt, O., Berube, P. M., Biller, S. J., … Stepanauskas, R. (2019). Charting the complexity of the marine microbiome through single-cell genomics. Cell , 179(7), 1623–1635.e11. doi: 10.1016/j.cell.2019.11.017
  • Parks, D. H., Chuvochina, M., Waite, D. W., Rinke, C., Skarshewski, A., Chaumeil, P.-A., & Hugenholtz, P. (2018). A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nature Biotechnology , 36(10), 996–1004. doi: 10.1038/nbt.4229
  • Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P., & Tyson, G. W. (2015). CheckM: Assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Research , 25(7), 1043–1055. doi: 10.1101/gr.186072.114
  • Perry, K. A., & Pedersen, T. F. (1993). Sulphur speciation and pyrite formation in meromictic ex-fjords. Geochimica et Cosmochimica Acta , 57(18), 4405–4418. doi: 10.1016/0016-7037(93)90491-E
  • Quast, C., Pruesse, E., Yilmaz, P., Gerken, J., Schweer, T., Yarza, P., … Glöckner, F. O. (2013). The SILVA ribosomal RNA gene database project: Improved data processing and web-based tools. Nucleic Acids Research , 41, D590–D596. doi: 10.1093/nar/gks1219
  • Rahfeld, P., Sim, L., Moon, H., Constantinescu, I., Morgan-Lang, C., Hallam, S. J., … Withers, S. G. (2019). An enzymatic pathway in the human gut microbiome that converts A to universal O type blood. Nature Microbiology , 4, 1475–1485. doi: 10.1038/s41564-019-0469-7
  • Rahfeld, P., Wardman, J. F., Mehr, K., Huff, D., Morgan-Lang, C., Chen, H.-M., … Withers, S. G. (2019). Prospecting for microbial α-N-acetylgalactosaminidases yields a new class of GH31 O-glycanase. The Journal of Biological Chemistry , 294(44), 16400–16415. doi: 10.1074/jbc.RA119.010628
  • Rinke, C., Schwientek, P., Sczyrba, A., Ivanova, N. N., Anderson, I. J., Cheng, J.-F., … Woyke, T. (2013). Insights into the phylogeny and coding potential of microbial dark matter. Nature , 499(7459), 431–437. doi: 10.1038/nature12352
  • Schön, M. E., Eme, L., & Ettema, T. J. G. (2020). PhyloMagnet: Fast and accurate screening of short-read meta-omics data using gene-centric phylogenetics. Bioinformatics , 36(6), 1718–1724. doi: 10.1093/bioinformatics/btz799
  • Shaiber, A., Eren, A. M., Shaiber, A., & Eren, A. M. (2019). Composite metagenome-assembled genomes reduce the quality of public genome repositories. mBio , 10, e00725–19. doi: 10.1128/mBio.00725-19
  • Singleton, C. M., Petriglieri, F., Kristensen, J. M., Kirkegaard, R. H., Michaelsen, T. Y., Andersen, M. H., … Albertsen, M. (2021). Connecting structure to function with the recovery of over 1000 high-quality metagenome-assembled genomes from activated sludge using long-read sequencing. Nature Communications , 12(1), 1–13. doi: 10.1038/s41467-021-22203-2
  • Sorokin, D. Y., Makarova, K. S., Abbas, B., Ferrer, M., Golyshin, P. N., Galinski, E. A., … Koonin, E. V. (2017). Discovery of extremely halophilic, methyl-reducing euryarchaea provides insights into the evolutionary origin of methanogenesis. Nature Microbiology , 2(8), 17081. doi: 10.1038/nmicrobiol.2017.81
  • Stark, M., Berger, S. A., Stamatakis, A., & von Mering, C. (2010). MLTreeMap - accurate maximum likelihood placement of environmental DNA sequences into taxonomic and functional reference phylogenies. BMC Genomics , 11(1), 461. doi: 10.1186/1471-2164-11-461
  • Thauer, R. K. (1998). Biochemistry of methanogenesis: A tribute to Marjory Stephenson. Microbiology , 144(1998), 2377–2406. doi: 10.1099/00221287-144-9-2377
  • UniProt Consortium. (2021). UniProt: The universal protein knowledgebase in 2021. Nucleic Acids Research , 49(D1), D480–D489. doi: 10.1093/nar/gkaa1100
  • Vagle, S., Hume, J., McLaughlin, F., MacIsaac, E., & Shortreed, K. (2010). A methane bubble curtain in meromictic Sakinaw Lake, British Columbia. Limnology and Oceanography , 55(3), 1313–1326. doi: 10.4319/lo.2010.55.3.1313
  • Vanwonterghem, I., Evans, P. N., Parks, D. H., Jensen, P. D., Woodcroft, B. J., Hugenholtz, P., & Tyson, G. W. (2016). Methylotrophic methanogenesis discovered in the archaeal phylum Verstraetearchaeota. Nature Microbiology , 1, 16170. doi: 10.1038/nmicrobiol.2016.170
  • Wagner, G. P., Kin, K., & Lynch, V. J. (2012). Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples. Theory in Biosciences , 131(4), 281–285. doi: 10.1007/s12064-012-0162-3
  • Wang, Y., Wegener, G., Hou, J., Wang, F., & Xiao, X. (2019). Expanding anaerobic alkane metabolism in the domain of Archaea. Nature Microbiology , 4, 595–602. doi: 10.1038/s41564-019-0364-2
  • Wardman, J. F., Rahfeld, P., Liu, F., Morgan-Lang, C., Sim, L., Hallam, S. J., & Withers, S. G. (2021). Discovery and development of promiscuous O-glycan hydrolases for removal of intact sialyl T-antigen. ACS Chemical Biology , 16(10), 2004–2015. doi: 10.1021/acschembio.1c00316
  • Wheeler, D. L., Barrett, T., Benson, D. A., Bryant, S. H., Canese, K., Chetvernin, V., … Yaschenko, E. (2007). Database resources of the National Center for Biotechnology Information. Nucleic Acids Research , 35, D5–D12. doi: 10.1093/nar/gkl1031
  • Whitman, W. B., Coleman, D. C., & Wiebe, W. J. (1998). Prokaryotes: The unseen majority. Proceedings of the National Academy of Sciences of the United States of America , 95(12), 6578–6583. doi: 10.1073/pnas.95.12.6578
  • Woodcroft, B. J., Boyd, J. A., & Tyson, G. W. (2016). OrfM: A fast open reading frame predictor for metagenomic data. Bioinformatics , 32(17), 2702–2703. doi: 10.1093/bioinformatics/btw241

Internet Resources

Docker documentation docker service create (2022).

推荐阅读

Nature Protocols
Protocols IO
Current Protocols
扫码咨询