Genome assembly (Nanopore and Illumina reads)
Thiago Mafra Batista, Rafael Rodrigues Ferrari
Abstract
This protocol offers detailed, step-by-step instructions for students and researchers to assemble nuclear genomes using long reads generated by Nanopore technology. Before assembling the genome, we will align the reads against a bacterial genome database to eliminate potential contamination. The assembled contigs will then be polished using Illumina short reads.
Steps
SEQUENCING QUALITY CHECK
LongQC (https://github.com/yfukasawa/LongQC)****)
Prepare a .pbs file to run the analysis remotely on Sagarana
python /home/fafinha/bin/LongQC/longQC.py sampleqc -x ont-ligation -c /tmp/LongQC_run/reads_trim.fq \
-p 64 -o /tmp/LongQC_run /home/fafinha/colletes_collaris/reads/genomic_reads/longreads_rawdata_collaris.fq
mv /tmp/LongQC_run/ /home/fafinha/colletes_collaris/
CROSS-SPECIES CONTAMINATION FILTERIN
Magic-BLAST (https://ncbi.github.io/magicblast/)****)
Index the database
$~/bin/ncbi-magicblast-1.7.0/bin/makeblastdb -in refseq_release_215_bacteria.fna -dbtype nucl
```***ONT whole-genome sequencing***
**Prepare a .pbs file to run the analysis remotely on Sagarana**
magicblast -db /databases/ref_prok_rep_genomes_out20/ref_prok_rep_genomes
-query /home/fafinha/collaris/reads/genomic_reads/reads/genomic_reads/ONT_longreads_rawdata_collaris.fq
-out_unaligned ONT_longreads_unaligned_in_refseq_prok_collaris.fa -num_threads 80 -infmt fastq -unaligned_fmt fasta > output.sam
**Prepare a .pbs file to run the analysis remotely on Sagarana**
magicblast -db /databases/ref_prok_rep_genomes_out20/ref_prok_rep_genomes -query /home/fafinha/collaris/reads/genomic_reads/Illumina_shortreads.R1.fastq
-query_mate /home/fafinha/collaris/reads/genomic_reads/Illumina_shortreads.R2.fastq
-paired -no_discordant -infmt fastq -unaligned_fmt sam -num_threads 128
-out_unaligned /home/fafinha/collaris/mafra/descontamination/illumina_reads/illumina_unaligned_in_refseq_prok.sam
-out /home/fafinha/collaris/mafra/descontamination/illumina_reads/illumina_aligned_in_refseq_prok.sam
$/programs/samtools-1.12/bin/samtools view -Sb -@12 illumina_unaligned_in_refseq_prok.sam > illumina_unaligned_in_refseq_prok.bam
$/programs/samtools-1.12/bin/samtools sort illumina_unaligned_in_refseq_prok.bam -o illumina_unaligned_in_refseq_prok_sorted.bam -@12
$/programs/samtools-1.12/bin/samtools fastq -1 paired1.fq -2 paired2.fq -n illumina_unaligned_in_refseq_prok_sorted.bam -@12
GENOME SIZE ESTIMATION
Jellyfish (https://github.com/gmarcais/Jellyfish)****)
Counting k-mers
Prepare a .pbs file to run the analysis remotely on Sagarana
/programs/jellyfish/jellyfish-2.3.0 count -C -m 21 -s 10G -t 36 /home/fafinha/collaris/reads/genomic_reads/D2015099C_L4_304X04.R1.fastq \ /home/fafinha/collaris/reads/genomic_reads/D2015099C_L4_304X04.R2.fastq -o /home/fafinha/collaris/Jellyfish/reads.jf
/programs/jellyfish/jellyfish-2.3.0 histo -t 36 /home/fafinha/collaris/Jellyfish/reads.jf > /home/fafinha/collaris/Jellyfish/reads.histo
```**Size estimation**
/////STRATEGY \#1: GenomeScope (on my PC)\\\\\
*Go to the directory where reads.histo is located*
$/home/rafael/genomescope2.0/genomescope.R -i reads.histo -o output -k 21
*Go to the directory where reads.histo is located*
$R
$library ("findGSE")
$findGSE(histo="reads.histo", sizek=21, outdir="21mer")
GENOME ASSEMBLY
NextDenovo (https://github.com/Nextomics/NextDenovo)****)
Prepare an 'input.fofn' file
$ls /home/fafinha/collaris/reads/genomic_reads/ONT_longreads_rawdata_collaris.fq > input.fofn
```***Prepare a 'run.cfg' file***
[General] job_type = local job_prefix = nextDenovo task = all rewrite = yes deltmp = yes parallel_jobs = 24 input_type = raw read_type = ont input_fofn = /home/fafinha/collaris/NextDenovo_run/input.fofn workdir = 01_rundir
[correct_option] read_cutoff = 1k genome_size = 300m sort_options = -m 20g -t 8 minimap2_options_raw = -t 8 pa_correction = 3 correction_options = -p 15
[assemble_option] minimap2_options_cns = -t 8 nextgraph_options = -a 1
/home/fafinha/bin/NextDenovo/nextDenovo /home/fafinha/collaris/NextDenovo_run/run.cfg
GENOME ASSEMBLY STATISTICS
scaffolds_stats
Compare two runs and include the stats into a .txt
$scaffold_stats.pl -f run1/assembly.fasta run2/assembly.fasta -N 1 -t 1000 10000 | tee stats.txt
```****BBMap ([https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbmap-guide/)****](https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbmap-guide/)****)
\#BBMap is part of BBTools ([https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/)](https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/))
***Run the script stats.sh using the main output produced by Flye (on my PC)***
$bash /mnt/d/Genomics/bbmap/stats.sh in=nd.asm.fasta out=nd.asm.fasta_stats.txt
**Run BUSCO (using a docker)**
$docker run --rm -e USERID=$UID -u $UID -v /home/rferrari/:/home/rferrari/ -w /home/rferrari/projetos/collaris/BUSCO_run/genome/post_polishment/NextDenovo/run4_RF_final/SRs ezlabgva/busco:v5.2.2_cv1 busco -i /home/rferrari/projetos/collaris/BUSCO_run/genome/post_polishment/NextDenovo/run4_RF_final/SRs/genome.nextpolish.fasta -l hymenoptera_odb10 --augustus_species Apis_mellifera -o run1 -m geno -c 12
\#To see list of available reference datasets
$docker run --rm -e USERID=$UID -u $UID -v /home/rferrari/:/home/rferrari/ -w /home/rferrari ezlabgva/busco:v5.2.2_cv1 busco --list-datasets
GENOME POLISHMENT
NextPolish (https://github.com/Nextomics/NextPolish)****)
Using only short reads
Prepare a 'sgs.fofn' file
$ls /home/fafinha/collaris/reads/genomic_reads/Illumina_shortreads.R1.fastq /home/fafinha/collaris/reads/genomic_reads/Illumina_shortreads.R2.fastq > sgs.fofn
```**Create a 'run.cfg' file**
[General] job_type = local job_prefix = nextPolish task = best rewrite = yes rerun = 3 parallel_jobs = 6 multithread_jobs = 5 genome = /home/fafinha/collaris/mafra/flye_run/run1/assembly.fasta genome_size = auto workdir = /home/fafinha/collaris/Nextpolish_run/01_rundir polish_options = -p {multithread_jobs}
[sgs_option] sgs_fofn = /home/fafinha/collaris/Nextpolish_run/sgs.fofn sgs_options = -max_depth 100 -bwa
/programs/NextPolish_n005/nextPolish /home/fafinha/collaris/Nextpolish_run/run.cfg
**Prepare a 'sgs.fofn' file**
$ls /home/fafinha/collaris/reads/genomic_reads/Illumina_shortreads.R1.fastq /home/fafinha/collaris/reads/genomic_reads/Illumina_shortreads.R2.fastq > sgs.fofn
$ls /home/fafinha/collaris/reads/genomic_reads/ONT_longreads_rawdata_collaris.fq > lgs.fofn
[General] job_type = local job_prefix = nextPolish task = best rewrite = yes rerun = 3 parallel_jobs = 8 multithread_jobs = 8 genome = /home/fafinha/collaris/mafra/flye_run/run1/assembly.fasta genome_size = 300m workdir = /home/fafinha/collaris/NextPolish_run/run5/long_short_reads/01_rundir polish_options = -p {multithread_jobs}
[sgs_option] sgs_fofn = /home/fafinha/collaris/NextPolish_run/run5/short_reads/sgs.fofn sgs_options = -max_depth 100 -bwa
[lgs_option] lgs_fofn = /home/fafinha/collaris/NextPolish_run/run5/long_short_reads/lgs.fofn lgs_options = -min_read_len 1k -max_depth 100 lgs_minimap2_options = -x map-ont
/programs/NextPolish_n005/nextPolish /home/fafinha/collaris/NextPolish_run/run5/long_short_reads/run.cfg