Biomarker's detection for diseases associated with metabolic disorder syndrome
Cosme E. Santiesteban Toca, Denisse Chacón, Alejandro Rojo Moreno, Saide Lizeth Medrano González, Leyla Escalante Gonzalez
Machine learning
Biomarkers
Mellitus diabetes
Diagnosis and prognosis of diabetes
Genome assembly
Gene expression
Identification of genes
Functional annotation
Taxonomic annotation
Metabolic syndrome
Gut microbiota.
Abstract
The metabolic syndrome (MetS) is known to substantially reduce the quality of life. MetS is associated with a high incidence of non-communicable diseases such as type 2 diabetes mellitus, cardiovascular diseases, cancer, among others. Multiple investigations focus the early diagnosis of MetS and its possible evolution in the patient on the basis of gene expression and clinical parameters.
However, we are interested in supporting the clinical diagnosis and prognosis of MetS-associated diseases based on the gut microbiota. Which means that we will take into account the set of microorganisms (bacteria, fungi, archaea, viruses and parasites) that reside in the intestine, given their relationship with diseases such as obesity, type 2 diabetes, as well as its influence on control glycemic.
Beyond of traditional diagnostic methods, Machine Learning (ML) can learn non-linear interactions iteratively from large amounts of data. This is possible using computer algorithms, which are already being applied in various fields, including the evaluation and prediction of disease risk.
The genes analysis belonging to the intestinal microbiota would allow the identification of excretory proteins with biomarker potential for the diagnosis and prognosis of diabetes and metabolic syndrome using supervised Machine Learning algorithms. For this reason, this project seeks to create a “pipeline” of classification algorithms (set of concatenated software) for data mining and analysis that allows predicting the appearance of type 2 diabetes and the progression of complications based on in the gut microbiota.
Before start
Steps
Download public databases
The SRA (Sequence Read Archive) is the standard format in which all NGS data is uploades into NCBI. To download and convert SRA files into FASTQ, download SRA Toolkit
Software
Value | Label |
---|---|
SRA-Toolkit | NAME |
Github | REPOSITORY |
NCBIA | DEVELOPER |
https://github.com/ncbi/sra-tools | LINK |
3.0.5 | VERSION |
Prepare the SRA-Toolkit workspace. For this step it is necessary to be located in the destination folder.
#Download space
vdb-config --prefetch-to-cwd
vdb-config --interactive
Access and download public databases. In this case, a database from the human gut metagenome in Amazon S3 was used. "The gut microbiome related effect of Berberine and probiotics in treating Type 2 Diabetes" (NCBI Accession number PRJNA643353) is a database with 1192 datasets (4-12 GBs each) in consecutive order from SAMN15421765 to SAMN1522956 of experiments. The data were obtained from a randomized, double-blind, placebo-controlled trial on newly diagnosed type 2 diabetes patients from 20 centers in China where 409 patients were randomly assigned to receive BBR, probiotic with BBR, probiotics, or placebo for 3 months.
Use prefetch command followed by the number of the run from the desired experiment to download and create a folder with the archive in .sra format through SRA Toolkit. In this case, the BioSample SAMN15421765 is being used, which Run number is SRR12234739
#prefetch
prefetch SRR12234739
Extract FASTQ files from SRA access with fasterq-dump.
#fasterq-dump
fasterq-dump SRR12234739 --split-files --skip-technical
Quality control
NGS data can be affected by multiple reasons during the library preparations or the sequencing process, which can negatively impact the quality of the raw data. To perform quality control of the raw data download, download FASTX-Toolkit
Software
Value | Label |
---|---|
FASTX-Toolkit | NAME |
Hannon Lab | REPOSITORY |
Hannon Lab | DEVELOPER |
http://hannonlab.cshl.edu/fastx_toolkit/download.html | LINK |
Clean the sequences based on quality and size. Since there is no established consensus on the value these parameters should have, a value = >30 is assumed to determine good sequences.
#fastq_quality_trimmer
fastq_quality_trimmer -t 30 -l 30 -v -i "$SRR12234739_1.fastq" -o "$SRR12234739 _1_trimmed.fastq"
#fastq_quality_trimmer
fastq_quality_trimmer -t 30 -l 30 -v -i "$SRR12234739_2.fastq" -o "$SRR12234739 _2_trimmed.fastq"
Genome Assembly
Sequence reads from NGS consist of small genetic sequences much shorter than genomes and even genes. Thus, the assembly of these short sequences into larger sequences (contigs) is necessary. To perform the genome assembly of the reads, download Spades.
Software
Value | Label |
---|---|
Spades | NAME |
CAB | REPOSITORY |
Center for Algorithmic Biotechnology | DEVELOPER |
https://cab.spbu.ru/software/spades/ | LINK |
3.15.4 | VERSION |
Read files with forward and reverse reads using -1 and -2 respectively
#Assembly
spades.py -t 40 -m 160 -1 "$SRR12234739_1.fastq" -2 "$SRR12234739_2.fastq" --only-assembler -o ensemble
Genome alignment
After the assembly, a reference genome is used to further piece together the sequenced data. Install BLAST setup for Unix to perform the Genome alignment
Software
Value | Label |
---|---|
BLAST | NAME |
NCBI | REPOSITORY |
NCBI | DEVELOPER |
https://www.ncbi.nlm.nih.gov/books/NBK52640/ | LINK |
Download the reference databases (if necessary)
#Database download
$ perl ../bin/update_blastdb.pl --passive --decompress 16S_ribosomal_RNA
Execute BLAST for nucleotides alignment
#blastn
blastn -query "$SRR12234739_contigs.fasta" -db "${DB}" -out "$SRR12234739_result.out" -outfmt 6 -num_threads 40 &
Functional annotation
Now it is necessary to determine the biological function of the sequenced data. Install Prokka to perform functional annotation of the data.
Software
Value | Label |
---|---|
Prokka | NAME |
Github | REPOSITORY |
Torsten Seemann | DEVELOPER |
https://github.com/tseemann/prokka | LINK |
Perform functional annotation
#prokka
prokka contigs.fasta --addgenes --mincontiglen 200 --centre Prokka --mincontiglen 200 --kingdom Bacteria --gcode 10 --evalue 1e-06 --cpus 0