Quality control assessment for microbial genomes: GalaxyTrakr MicroRunQC workflow

Ruth Timme, Errol Strain, Maria Balkey, Tina Lusk Pfefer, Yesha Shrestha, Paul Morin

Published: 2023-03-16 DOI: 10.17504/protocols.io.5jyl8mj16g2w/v4

microbial pathogen survielliance

Disclaimer

Please note that this protocol is public domain, which supersedes the CC-BY license default used by protocols.io.

Abstract

PURPOSE: Step-by-step instructions for checking WGS sequence quality for bacterial pathogens. The MicroRunQC workflow, implemented in a custom Galaxy instance, will produce quality assessments for raw reads (Illumina paired-end fastq files) and draft de novo assemblies, along with reporting the sequence type for each isolate. This workflow will work on most microbial pathogens, so we advise laboratories to upload their entire MiSeq/NextSeq run through this workflow.

SCOPE: This protocol covers the following tasks:

set up an account in GalaxyTrakr
Create a new history/workspace
Upload data
Execute the MicroRunQC workflow
Interpret the results

Version updates:

V3: updated with Cronobacter thresholds

V4: MicroRunQC updated to V1.1 Includes updates to skeza and mlst methods, as well as adjusted assembly QC thresholds for E.coli. Added Enterobacter QC thresholds to threshold table.

Before start

Steps

Account set up

Create a GalaxyTrakr account here: https://account.galaxytrakr.org/Account/Register

1.1.

Log into your GalaxyTrakr account: https://galaxytrakr.org

Create a new history

Create a new history.

We recommend creating a new history for each new MiSeq Run and including the flow-cell ID and date in the history name.

Save your MicroRunQC output here and any other relevant analyses, like serotyping, or AMR detection.

After all the analysis output from this run is saved to your internal data network or computer, older history's should be purged/deleted so as not to occupy the limited storage space in your account. In some cases it may be useful to save, for a limited time, multiple histories or to run analyses concurrently in multiple histories. In these cases you need to pay attention to your % usage bar (shows % used of allocated storage space) in the upper right corner of the GalaxyTrakr page. If you need additional space you can contact galaxytrakrsupport@fda.hhs.gov and request additional storage.

2.1.

Click on the + icon in the upper right History panel

2.2.

Name your new History by clicking on the “Unnamed history”, type in desired name and hit enter. We recommend including the run cell ID and the date the run was started.

Upload data

This section will describe the process for uploading raw fastq files into your active History panel. After the files have been uploaded they will stay in your account until they are deleted.

3.1.

Click on the Upload Data icon on the top of the left web page to start an upload process.

3.2.

Select "Type (set all):auto-detect." Choose local file button and navigate to the desired fastq files, then click "start" to upload files. These files should be paired (two per sample/isolate).

As the file uploads complete, each row will turn green. Samples in yellow are still in process.

3.3.

You have just upload a set of forward and reverse reads. For further analysis these files need to be paired properly so the platform knows which R1 and R2 files go with each sample/isolate. GalaxyTrakr does this by creating a List of Dataset Pairs.

Within your newly created History panel, click the "check box," then select all the files you just uploaded by clicking "All" or by individually selecting the ones you want to pair.

Screenshot of History panal showing recently uploaded files. Note the way the files are named, using R1 and R2 to identify the paired reads. This will be important in the next step. Some naming conventions can be slightly different.

3.4.

Click "For all selected" and choose "Build List of Dataset Pairs"

3.5.

A new window will open to help you pair the fastq files properly. Note how your paired reads are named.

Select Clear filters , then click Auto-pair.

If auto-pairing does not work, you can click "choose filters" and select the appropriate filter for the pairing:

e.g. choose "_R1 "and "_R2 "

3.6.

Paired reads will pair in the middle column and turn green.

Name your dataset: Example, "pairedSet--"

Click Create list.

3.7.

This paired dataset will now be available for analysis in your history panel. You can run multiple analyses on the same dataset in a history rather than upload the same sequence data to a new history to perform additional analyses. This will help you use your allocated storage space efficiently.

You can re-name this PairedList by clicking on the name.

Run the MicroRunQC workflow

Add the MicroRunQC workflow to your own "workflows" panel. You only have to do this step once for each new workflow you need.

4.1.

Navigate to the “ Shared Data " drop down menu, choose workflows and locate MicroRunQC_v1.1

From Dropdown, select " Import "

4.2.

To see the new imported workflow, click “Workflow” tab on the top panel.

Click "Bookmarked" box to make it available in the left panel under "Workflows"

4.3.

From the Workflow menu on the left panel, select MicroRunQC_v1.1

4.4.

Select paired list dataset you created earlier.

Click Run Workflow . This can take some time depending on the number of samples you are analyzing. If you choose to you can log out of GalaxyTrakr and log back in at a later time to see if the job is completed.

4.5.

Upon completion of the pipeline all tiles in the history bar will be green.

In the “ Filter on Data ## ”, click on the “Eye” icon to view the output table in the GalaxyTrakr window.

Interpret the results

Download and interpret the results:

5.1.

Click “ Filter on data ## ” and then the floppy disc icon. The tabular file can be opened in a text reader or converted to a format (.txt) that can be opened in excel.

5.2.

The MicroRunQC output file includes the following columns:

Parameter | A | B | C | | --- | --- | --- | | Parameter | Input | Description | | Contigs | Assembly | Number of contigs in the de-novo SKESA assembly. Contigs smaller than 200 base-pairs (bp) are not counted. | | Length | Assembly | Total length of all contigs > 200bp. This should approximate the size of the genome for the target organism. | | EstCov | Assembly | Mean coverage for contigs in the SKESA assembly. | | N50 | Assembly | Sequence length of the shortest contig at 50% of the total genome length | | MedianInsert | Read | Distance between forward and reverse reads. Calculated by mapping reads to SKESA assembly using bwa. | | MeanLength_R1 | Read | Mean length of forward read | | MeanLength_R2 | Read | Mean length of reverse read | | MeanQ_R1 | Read | Mean Q-score of forward read | | MeanQ_R2 | Read | Mean Q-score of reverse read | | Scheme | Assembly | PubMLST scheme name (output from mlst application that scans contig files against traditional PubMLST typing schemes. | | ST | Assembly | Sequence Type | | Loci | Assembly | gene (allele number) – for example aroC(118) |

MicroRunQC output table headers. This table lists the summary metrics for sequence quality, number of contigs, and estimated genome size, along with other common metrics for reads (Median Insert Size and Mean Length) and assemblies (N50). Additionally, if the Multi-Locus Sequence Type (MLST) for the isolate is available from pubmlst, the workflow also reports Sequence Type (ST) and the associated alleles. Input Description Contigs Assembly Number of contigs in the de-novo SKESA assembly. Contigs smaller than 200 base-pairs (bp) are not counted. Length Assembly Total length of all contigs > 200bp. This should approximate the size of the genome for the target organism. EstCov Assembly Mean coverage for contigs in the SKESA assembly. N50 Assembly Sequence length of the shortest contig at 50% of the total genome length MedianInsert Read Distance between forward and reverse reads. Calculated by mapping reads to SKESA assembly using bwa. MeanLength_R1 Read Mean length of forward read MeanLength_R2 Read Mean length of reverse read MeanQ_R1 Read Mean Q-score of forward read MeanQ_R2 Read Mean Q-score of reverse read Scheme Assembly PubMLST (pubmlst.org) database scheme (e.g. senterica for Salmonella enterica) ST Assembly Sequence Type Loci Assembly gene (allele number) – for example aroC(118)

**This output should be saved either to your LIMS or to a spreadsheet linked to the sequencing run and samples.

5.3.

Example output for 1 Salmonella and 5 Listeria isolates.

A	B
Srain ID	Lab Confirmation
FDA1216271-C001-001	Listeria mono
FDA817806-S073-001	Listeria mono
FDA746634	Listeria mono
FDA1213377-C001-002	Listeria grayi
FDA933376-S060-005	Listeria innocua
FDA1213835-C001-001	Salmonella

Lab confirmed IDs for 6 isolates

A	B	C	D	E	F	G	H	I	J	K	L	M	N	O	P	Q	R	S
File	Contigs	Length	EstCov	N50	Median Insert	Mean Length_R1	Mean Length_R2	Mean Q_R1	Mean Q_R2	Scheme
FDA1216271-C001-001	16	2911949	36.7	476210	321	148.4	148.4	36.4	34.6	listeria_2	5	abcZ(2)	bglA(1)	cat(11)	dapE(3)	dat(3)	ldh(1)	lhkA(7)
FDA817806-S073-001	20	3068354	179.6	525438	329	234.7	235.2	36.7	31.9	listeria_2	321	abcZ(5)	bglA(6)	cat(8)	dapE(62)	dat(6)	ldh(7)	lhkA(34)
FDA746634	30	3052888	41.4	293947	320	148.4	148.4	36.5	36	listeria_2	-	abcZ(2)	bglA(1)	cat(11)	dapE(3)	dat(3)	ldh(1)	lhkA(~7)
FDA1213377-C001-002	20	2672180	155.1	473181	270	147.3	147.3	37.2	36.1	-	-
FDA933376-S060-005	9	2881869	213	1498790	303	232.1	232.2	37	36.2	listeria_2	1489	abcZ(250)	bglA(21)	cat(83)	dapE(298)	dat(20)	ldh(458)	lhkA(216)
FDA1213835-C001-001	37	4832365	34.4	294936	354	149	149	36.6	35.7	senterica_achtman_2	214	aroC(14)	dnaN(72)	hemD(21)	hisD(12)	purE(6)	sucA(19)	thrA(15)

MicroRunQC example report showing mlst ST results for different Listeria species.

The listeria database includes multiple species, including Listeria monocytogenes and L. innocua . If users want to investigate which Listeria DB corresponds to the resulting ST types, they can query the Institut Pasteur mlst database:

Example : query Listeria ST type here: https://bigsdb.pasteur.fr/cgi-bin/bigsdb/bigsdb.pl?db=pubmlst_listeria_seqdef&page=query

5.4.

Quality control threshold guidelines for the GenomeTrakr surveillance network. These are also relevant for NARMS and VetLIRN contributors.

*MicroRunQC users should follow QC threshold guidelines established by their respective surveillance coordinating body(s).

A	B	C	D	E	F	G	H	I	J
Quality metric	Salmonella	Listeria	E. coli	Shigella	Campylobacter	Vibrio para.	Cronobacter	Enterococcus faecium	Enterococcus faecalis
Average read quality Q score for R1 and R2	>=30	>=30	>=30	>=30	>=30	>=30	>=30	>=30	>=30
Average coverage	>=30X	>=20X	>=40X	>=40X	>=20X	>=40X	>=20X	>=50X	>=40X
De novo assembly: Seq. length (Mbp)	~4.3-5.2	~2.7-3.2	~4.5-5.9	~4.0-5.0	~1.5-1.9	~4.8-5.5	~4-5	~2.5-3.5	~2.5-3.25
De novo assembly: no. contigs	<=300	<=300	<=400	<=550	<=300	<=300	<=500	<=350	<=200