Biomarker's detection for diseases associated with metabolic disorder syndrome

Cosme E. Santiesteban Toca, Denisse Chacón, Alejandro Rojo Moreno, Saide Lizeth Medrano González, Leyla Escalante Gonzalez

Published: 2024-04-10 DOI: 10.17504/protocols.io.rm7vzxnb5gx1/v1

Diagnosis and prognosis of diabetes

Abstract

The metabolic syndrome (MetS) is known to substantially reduce the quality of life. MetS is associated with a high incidence of non-communicable diseases such as type 2 diabetes mellitus, cardiovascular diseases, cancer, among others. Multiple investigations focus the early diagnosis of MetS and its possible evolution in the patient on the basis of gene expression and clinical parameters.

However, we are interested in supporting the clinical diagnosis and prognosis of MetS-associated diseases based on the gut microbiota. Which means that we will take into account the set of microorganisms (bacteria, fungi, archaea, viruses and parasites) that reside in the intestine, given their relationship with diseases such as obesity, type 2 diabetes, as well as its influence on control glycemic.

Beyond of traditional diagnostic methods, Machine Learning (ML) can learn non-linear interactions iteratively from large amounts of data. This is possible using computer algorithms, which are already being applied in various fields, including the evaluation and prediction of disease risk.

The genes analysis belonging to the intestinal microbiota would allow the identification of excretory proteins with biomarker potential for the diagnosis and prognosis of diabetes and metabolic syndrome using supervised Machine Learning algorithms. For this reason, this project seeks to create a “pipeline” of classification algorithms (set of concatenated software) for data mining and analysis that allows predicting the appearance of type 2 diabetes and the progression of complications based on in the gut microbiota.

Before start

To facilitate bioinformatics processing in each step of the process, a group of specific tools are necessary:

Steps

Download public databases

The SRA (Sequence Read Archive) is the standard format in which all NGS data is uploades into NCBI. To download and convert SRA files into FASTQ, download SRA Toolkit

Software

Value	Label
SRA-Toolkit	NAME
Github	REPOSITORY
NCBIA	DEVELOPER
https://github.com/ncbi/sra-tools	LINK
3.0.5	VERSION

Prepare the SRA-Toolkit workspace. For this step it is necessary to be located in the destination folder.

#Download space 
 vdb-config --prefetch-to-cwd

vdb-config --interactive

Access and download public databases. In this case, a database from the human gut metagenome in Amazon S3 was used. "The gut microbiome related effect of Berberine and probiotics in treating Type 2 Diabetes" (NCBI Accession number PRJNA643353) is a database with 1192 datasets (4-12 GBs each) in consecutive order from SAMN15421765 to SAMN1522956 of experiments. The data were obtained from a randomized, double-blind, placebo-controlled trial on newly diagnosed type 2 diabetes patients from 20 centers in China where 409 patients were randomly assigned to receive BBR, probiotic with BBR, probiotics, or placebo for 3 months.

Dateset

Human gut metagenome database https://www.ncbi.nlm.nih.gov/bioproject/PRJNA643353

3.1.

Use prefetch command followed by the number of the run from the desired experiment to download and create a folder with the archive in .sra format through SRA Toolkit. In this case, the BioSample SAMN15421765 is being used, which Run number is SRR12234739

#prefetch 
prefetch SRR12234739

Extract FASTQ files from SRA access with fasterq-dump.

#fasterq-dump 
fasterq-dump SRR12234739 --split-files --skip-technical

Quality control

NGS data can be affected by multiple reasons during the library preparations or the sequencing process, which can negatively impact the quality of the raw data. To perform quality control of the raw data download, download FASTX-Toolkit

Software

Value	Label
FASTX-Toolkit	NAME
Hannon Lab	REPOSITORY
Hannon Lab	DEVELOPER
http://hannonlab.cshl.edu/fastx_toolkit/download.html	LINK

Clean the sequences based on quality and size. Since there is no established consensus on the value these parameters should have, a value = >30 is assumed to determine good sequences.

#fastq_quality_trimmer 
fastq_quality_trimmer -t 30 -l 30 -v -i "$SRR12234739_1.fastq" -o "$SRR12234739 _1_trimmed.fastq"

#fastq_quality_trimmer 
fastq_quality_trimmer -t 30 -l 30 -v -i "$SRR12234739_2.fastq" -o "$SRR12234739 _2_trimmed.fastq"

Genome Assembly

Sequence reads from NGS consist of small genetic sequences much shorter than genomes and even genes. Thus, the assembly of these short sequences into larger sequences (contigs) is necessary. To perform the genome assembly of the reads, download Spades.

Software

Value	Label
Spades	NAME
CAB	REPOSITORY
Center for Algorithmic Biotechnology	DEVELOPER
https://cab.spbu.ru/software/spades/	LINK
3.15.4	VERSION

Read files with forward and reverse reads using -1 and -2 respectively

#Assembly 
spades.py -t 40 -m 160 -1 "$SRR12234739_1.fastq" -2 "$SRR12234739_2.fastq" --only-assembler -o ensemble

Genome alignment

After the assembly, a reference genome is used to further piece together the sequenced data. Install BLAST setup for Unix to perform the Genome alignment

Software

Value	Label
BLAST	NAME
NCBI	REPOSITORY
NCBI	DEVELOPER
https://www.ncbi.nlm.nih.gov/books/NBK52640/	LINK

10.

Download the reference databases (if necessary)

#Database download 
$ perl ../bin/update_blastdb.pl --passive --decompress 16S_ribosomal_RNA

11.

Execute BLAST for nucleotides alignment

#blastn 
blastn -query "$SRR12234739_contigs.fasta" -db "${DB}" -out "$SRR12234739_result.out" -outfmt 6 -num_threads 40 &

Functional annotation

12.

Now it is necessary to determine the biological function of the sequenced data. Install Prokka to perform functional annotation of the data.

Software

Value	Label
Prokka	NAME
Github	REPOSITORY
Torsten Seemann	DEVELOPER
https://github.com/tseemann/prokka	LINK

13.

Perform functional annotation

#prokka 
prokka contigs.fasta --addgenes --mincontiglen 200 --centre Prokka --mincontiglen 200 --kingdom Bacteria --gcode 10 --evalue 1e-06 --cpus 0

Biomarker&#39;s detection for diseases associated with metabolic disorder syndrome

Abstract

Before start

Steps

Download public databases

Quality control

Genome Assembly

Genome alignment

Functional annotation

推荐阅读

Biomarker's detection for diseases associated with metabolic disorder syndrome