Guidance for populating GenomeTrakr metadata templates (BioSample and SRA)

Ruth Timme, Errol Strain, Maria Balkey, Tina Lusk Pfefer

Published: 2023-02-17 DOI: 10.17504/protocols.io.eq2ly3x1pgx9/v10

Disclaimer

Please note that this protocol is public domain, which supersedes the CC-BY license default used by protocols.io.

Abstract

PURPOSE: Guidance on how to populate NCBI's metadata packages, maximizing interoperability for foodborne pathogen surveillance.

SCOPE : This protocol provides detailed instructions for populating the following two templates:

  1. BioSample metadata : guidelines for obtaining and populating metadata templates describing the sample.

  2. SRA metadata: Guidelines for populating sequence-level metadata template.

Versions:

v6: Added the One Health Enteric package presented at IAFP 2021 meeting.

v7: Updated the picklists in the GenomeTrakr-extended pathogen package, "GT-pathogen package-OHE v0.2.2.xlsx" and added an incremental update file for the DRAFT One Health Enteric Package that includes extensive edits compared to v6 .

v8: Updated the picklists in the GenomeTrakr-extended pathogen package, "GT-pathogen package-OHE v0.2.2.xlsx". Also provided a direct link to the newly published One Health Enteric package.

v9: Bug fix

v10 : updates to the GenomeTrakr-extended pathogen biosample template ( GT-pathogen package-OHE v0.3.xlsx ) and release of newly available One Health Enteric package custom templates.

Before start

Before collecting sequence data for your isolates, ensure that you can provide the minimum metadata recommended by your coordinating surveilliance body. The INSDC, in collaboration with the Global Microbial Identifer (GMI) (https://www.globalmicrobialidentifier.org), recommends using the Pathogen metadata template for pathogen surveilliance submissions: (NCBI: https://www.ncbi.nlm.nih.gov/pathogens/submit-data/and EMBL-EBI: https://www.ebi.ac.uk/ena/submit/pathogen-data).

Steps

Overview

1.

Guidance for organizing and populating the metadata templates required for direct submission to NCBI. This guidance is applicable for most enterics and/or microbial pathogens.

Note
**PulseNet labs: for submissions through BioNumerics, please follow thisprotocol. **PulseNet labs: for submissions through BioNumerics, please follow this protocol..

Two metadata templates are required for each NCBI submission:

  1. BioSample metadata (metadata describing the sample source and submitter)

  2. SRA metadata (metadata describing the sequence data collection)

BioSample metadata template

2.

Templates for BioSample submission:

Laboratories can choose one of the two following templates, offered in Step 2.1 or Step 2.2 .

Validation: Download and populate the appropriate template in Step 2.1 or Step 2.2 , then validate it here prior to NCBI submission: https://gmvs.fda.gov/

2.1.

NCBI Pathogen package, customized for US labs doing enteric surveillance, including GenomeTrakr labs. This template has been widely used since 2013.

GT-pathogen package v0.3.0.xlsx

2.2.

One Health Enteric Package: new metadata package available now for US labs doing enteric surveillance, including GenomeTrakr labs:

Custom, version-controlled template(s) available for download here: OHE GitHub page. OHE GitHub page.

  • Our custom templates include extensive guidance and controlled vocabularies for most attributes.
  • Sub-packages are available for download covering the major One Health samples types (human/animal hosts, food, food facilities, and farm/environment). Users can choose to populate the full package, or one more more of the sub-packages.

A generic version of this template was published by NCBI in 2022.

SRA sequence metadata template

3.

Template for SRA metadata submission:

Download the generic "Metadata spreadsheet with sample names" file from the NCBI Submission Templates page:

https://submit.ncbi.nlm.nih.gov/templates/

And follow the guidance in the following table:

PRO TIPS:

  1. If you have sequences to submit that belong to more than one BioProject, create a separate submission + metadata table for each of your BioProjects.
  2. Entering fastq filenames in the spreadsheet : On a Mac, you can directly copy the file names from the folder into a spreadsheet. This is not possible on a PC using copy and paste but can be done with some command-line operation.
  3. Finally, it is important to develop a QA/QC step to make sure the files are associated with the correct sample name. For example, use a left function in excel to strip of the appended text in the file name and then use the exact match to make sure the name matches the sample name.
3.1.
ABC
FieldDescriptionExample
sample_nameInclude the same ID here as you entered for "sample_name" in the BioSample submission template. Populate this field using the values in the PHA4GE specification for "specimen collector sample ID".UT-12345
library_IDThe library name should be a unique ID relevant to your workflow. It can be an autogenerated ID from your LIMS system or a modification of your sample_name. Populate this field using the values in the PHA4GE specification for "library_id".UT-12345.6
TitleShort, free text description that identifies the data on public pages. For Example: {methodology} of {organism}: {sample_name}WGS of Salmonella enterica: UT-12345
library_strategyOverall sequencing strategy or approach. Choose from NCBI pick listSee NCBI SRA pick list. (e.g. WGS)
library_sourcemolecule type used to make the librarySee NCBI SRA pick list. (e.g. Genomic)
library_selectionLibrary capture methodSee NCBI SRA pick list. (e.g. random, PCR)
Library_layoutChoose from NCBI pick listSee NCBI SRA pick list, choose "paired"
platformSequencing platformSee NCBI SRA pick list. (e.g., Illumina).
instrument_modelName of the sequencing instrument.See NCBI SRA pick list. (e.g. Illumina MiSeq, iSeq 100)
Design_descriptionFree text description of methods
FiletypeFile format name for the raw sequence data Choose from NCBI pick listSee NCBI SRA pick list. (e.g. Fastq)
Filenameinclude ALL of the files resulting from this library. **Add additional fields if there are more than two files (e.g. Filename3). Populate this field using the values in the PHA4GE specification for "r1 fastq filename".genome_r1.fastq (*must be exact)
Filename2genome_r2.fastq (*must be exact) Populate this field using the values in the PHA4GE specification for "r2 fastq filename".genome_r2.fastq (*must be exact)
Filename3-8list other fastq file names (e.g. for NextSeq data)

Save the second sheet (SRA_data) as a TSV (tab-delimited file) for upload in the “SRA metadata” tab within the submission portal.

*NCBI should also accept the original excel formatted file.

推荐阅读

Nature Protocols
Protocols IO
Current Protocols
扫码咨询