Biomarker's detection for diseases associated with metabolic disorder syndrome

Cosme E. Santiesteban Toca, Denisse Chacón, Alejandro Rojo Moreno, Saide Lizeth Medrano González, Leyla Escalante Gonzalez

Published: 2024-04-10 DOI: 10.17504/protocols.io.rm7vzxnb5gx1/v1

Abstract

The metabolic syndrome (MetS) is known to substantially reduce the quality of life. MetS is associated with a high incidence of non-communicable diseases such as type 2 diabetes mellitus, cardiovascular diseases, cancer, among others. Multiple investigations focus the early diagnosis of MetS and its possible evolution in the patient on the basis of gene expression and clinical parameters.

However, we are interested in supporting the clinical diagnosis and prognosis of MetS-associated diseases based on the gut microbiota. Which means that we will take into account the set of microorganisms (bacteria, fungi, archaea, viruses and parasites) that reside in the intestine, given their relationship with diseases such as obesity, type 2 diabetes, as well as its influence on control glycemic.

Beyond of traditional diagnostic methods, Machine Learning (ML) can learn non-linear interactions iteratively from large amounts of data. This is possible using computer algorithms, which are already being applied in various fields, including the evaluation and prediction of disease risk.

The genes analysis belonging to the intestinal microbiota would allow the identification of excretory proteins with biomarker potential for the diagnosis and prognosis of diabetes and metabolic syndrome using supervised Machine Learning algorithms. For this reason, this project seeks to create a “pipeline” of classification algorithms (set of concatenated software) for data mining and analysis that allows predicting the appearance of type 2 diabetes and the progression of complications based on in the gut microbiota.

Before start

To facilitate bioinformatics processing in each step of the process, a group of specific tools are necessary:

Pipeline of bioinformatics tools
Pipeline of bioinformatics tools

Steps

Download public databases

1.

The SRA (Sequence Read Archive) is the standard format in which all NGS data is uploades into NCBI. To download and convert SRA files into FASTQ, download SRA Toolkit

Software

ValueLabel
SRA-ToolkitNAME
GithubREPOSITORY
NCBIADEVELOPER
https://github.com/ncbi/sra-toolsLINK
3.0.5VERSION
2.

Prepare the SRA-Toolkit workspace. For this step it is necessary to be located in the destination folder.

#Download space 
 vdb-config --prefetch-to-cwd 
vdb-config --interactive
3.

Access and download public databases. In this case, a database from the human gut metagenome in Amazon S3 was used. "The gut microbiome related effect of Berberine and probiotics in treating Type 2 Diabetes" (NCBI Accession number PRJNA643353) is a database with 1192 datasets (4-12 GBs each) in consecutive order from SAMN15421765 to SAMN1522956 of experiments. The data were obtained from a randomized, double-blind, placebo-controlled trial on newly diagnosed type 2 diabetes patients from 20 centers in China where 409 patients were randomly assigned to receive BBR, probiotic with BBR, probiotics, or placebo for 3 months.

Dateset

3.1.

Use prefetch command followed by the number of the run from the desired experiment to download and create a folder with the archive in .sra format through SRA Toolkit. In this case, the BioSample SAMN15421765 is being used, which Run number is SRR12234739

#prefetch 
prefetch SRR12234739
4.

Extract FASTQ files from SRA access with fasterq-dump.

#fasterq-dump 
fasterq-dump SRR12234739 --split-files --skip-technical

Quality control

5.

NGS data can be affected by multiple reasons during the library preparations or the sequencing process, which can negatively impact the quality of the raw data. To perform quality control of the raw data download, download FASTX-Toolkit

Software

ValueLabel
FASTX-ToolkitNAME
Hannon LabREPOSITORY
Hannon LabDEVELOPER
http://hannonlab.cshl.edu/fastx_toolkit/download.htmlLINK
6.

Clean the sequences based on quality and size. Since there is no established consensus on the value these parameters should have, a value = >30 is assumed to determine good sequences.

#fastq_quality_trimmer 
fastq_quality_trimmer -t 30 -l 30 -v -i "$SRR12234739_1.fastq" -o "$SRR12234739 _1_trimmed.fastq"
#fastq_quality_trimmer 
fastq_quality_trimmer -t 30 -l 30 -v -i "$SRR12234739_2.fastq" -o "$SRR12234739 _2_trimmed.fastq"

Genome Assembly

7.

Sequence reads from NGS consist of small genetic sequences much shorter than genomes and even genes. Thus, the assembly of these short sequences into larger sequences (contigs) is necessary. To perform the genome assembly of the reads, download Spades.

Software

ValueLabel
SpadesNAME
CABREPOSITORY
Center for Algorithmic BiotechnologyDEVELOPER
https://cab.spbu.ru/software/spades/LINK
3.15.4VERSION
8.

Read files with forward and reverse reads using -1 and -2 respectively

#Assembly 
spades.py -t 40 -m 160 -1 "$SRR12234739_1.fastq" -2 "$SRR12234739_2.fastq" --only-assembler -o ensemble

Genome alignment

9.

After the assembly, a reference genome is used to further piece together the sequenced data. Install BLAST setup for Unix to perform the Genome alignment

Software

ValueLabel
BLASTNAME
NCBIREPOSITORY
NCBIDEVELOPER
https://www.ncbi.nlm.nih.gov/books/NBK52640/LINK
10.

Download the reference databases (if necessary)

#Database download 
$ perl ../bin/update_blastdb.pl --passive --decompress 16S_ribosomal_RNA
11.

Execute BLAST for nucleotides alignment

#blastn 
blastn -query "$SRR12234739_contigs.fasta" -db "${DB}" -out "$SRR12234739_result.out" -outfmt 6 -num_threads 40 &

Functional annotation

12.

Now it is necessary to determine the biological function of the sequenced data. Install Prokka to perform functional annotation of the data.

Software

ValueLabel
ProkkaNAME
GithubREPOSITORY
Torsten SeemannDEVELOPER
https://github.com/tseemann/prokkaLINK
13.

Perform functional annotation

#prokka 
prokka contigs.fasta --addgenes --mincontiglen 200 --centre Prokka --mincontiglen 200 --kingdom Bacteria --gcode 10 --evalue 1e-06 --cpus 0

推荐阅读

Nature Protocols
Protocols IO
Current Protocols
扫码咨询