Demultiplexing Nanopore reads with LAST

David A Eccles

Published: 2021-12-23 DOI: 10.17504/protocols.io.b28fqhtn

Abstract

This protocol is for a semi-manual method for read demultiplexing, as used after my presentation Sequencing DNA with Linux Cores and Nanopores to work out the number of reads captured by different barcodes.

Input: reads as a FASTQ file, barcode sequences as a FASTA file

Output: reads split into single FASTQ files per target [barcode]

Note: barcode / adapter sequences are not trimmed by this protocol

Before start

Steps

Combine read sequences

Combine all input reads into a single file

pv ../called_all/*.fastq | gzip > reads_all.fastq.gz
```Note: I'm using the pipe viewer command  _pv_  to produce a progress indicator while the command is running. If this command is not available, it can be replaced with  _cat_  with no change in function (apart from not showing progess).

Generating Barcode Index

Prepare a FASTA file containing barcode sequences (see attached FASTA file). To reduce the chance of mismatched adapters, this should only contain the barcode sequences. That restriction means this approach will not work for short reads, where the barcode sequences are very likely to occur within sequences.

barcode_base.fa

Prepare the LAST index for the barcode file. Following Martin Frith's recommendation, the '-uNEAR' seeding scheme is used to slightly increase sensitivity. This will generate seven additional files of the form .XXX:

lastdb -uNEAR barcode_base.fa barcode_base.fa

Prepare a substitution matrix for barcode mapping. The default substitution matrix is swayed too much by INDELs in the barcode sequences, so here's one that I've developed using last-train using super-accuracy Nanopore reads called using Guppy v5:

#last -Q 1
#last -t4.34158
#last -a 31
#last -A 35
#last -b 4
#last -B 3
#last -S 1
# score matrix (query letters = columns, reference letters = rows):
       A      C      G      T
A      8    -18    -14    -22
C    -18      7    -20    -23
G    -14    -20      6    -24
T    -22    -23    -24      4
```[bc.mat](https://static.yanyin.tech/literature_test/protocol_io_true/protocols.io.b28fqhtn/f8ryjpt67.mat) 





I generated this matrix from the following command:

last-train --matsym -Q 1 -P 10 barcode_base.fa reads_all.fastq.gz > bc.mat


This matrix has a small penalty for opening gaps (i.e. insertions and deletions), and a smaller penalty for inserting them. Insertions and deletions are considered to be equally likely in the barcode region. It also has a moderate penalty for A/G transition variants, and a higher penalty for C/T transition variants (but still lower than other substitution penalties).

Mapping Reads to Barcodes

Use LAST, including FASTQ quality scores for substitution (-Q 1), using the pre-defined substitution matrix to map the reads. In this example, it is distributed over 10 processing threads (-P 10). Here maf-convert is used to convert to a single line per match, cut retains only the barcode and read IDs, and uniq is used to make sure that multiple same barcodes per read (e.g. for reverse / complement barcodes at each end) will not produce duplicates:

lastal -Q 1 -P10 -p bc.mat barcode_base.fa <(pv reads_all.fastq.gz) | \
  maf-convert -n tab | cut -f 2,7 | sort | uniq | \
  gzip > barcode_assignments.txt.gz
```For a more stringent search, lastal can do an overlap match (-T 1) rather than the default local match, which will make sure that only the ends of the barcode are matched (hopefully the entire barcode) when demultiplexing. The downside is that this is more likely to drop reads due to slight mismatches in the barcode portion of the read:


Stringency can also be altered by adjusting the query letters per random alignment setting (-D <value>, 1e6 by default). Lowering this number will produce more matches, at the expense of more false positive matches:


The output of this command will be a gzipped tab-separated 2-column file with barcode names in the first column, and read IDs in the second column.

Splitting Read File Per Barcode

For each discovered barcode, using the appropriate read category assignment file, find the corresponding read IDs, then extract those IDs out of the read FASTQ file. This uses one of my scripts, fastx-fetch.pl, to do this directly from a FASTQ file. The ' -lengths ' command-line parameter also outputs sequence lengths for each read (see next step):

fastx-fetch.pl

mkdir -p demultiplexed
fastx-fetch.pl -lengths -demultiplex barcode_assignments.txt.gz \
  -prefix 'demultiplexed/reads' <(pv reads_all.fastq.gz) > barcode_counts.txt

Note: this demultiplexing code will only by default put reads into a barcode bin if they have a single unique barcode sequence detected. Otherwise, they will be put into a 'BCchim' bin if multiple adapters are detected (i.e. a chimeric read), or a 'BCmiss' bin if no adapters are detected. If these reads should be duplicated and put in one bin per barcode, then the -chimeric option can be added to the command arguments:

mkdir -p demultiplexed
fastx-fetch.pl -lengths -demultiplex barcode_assignments.txt.gz -chimeric \
  -prefix 'demultiplexed/reads' <(pv reads_all.fastq.gz) > barcode_counts.txt

[optional] Displaying Read Length Statistics

The lengths output from the demultiplexing step can be fed into another one of my scripts, length-plot.r, in order to display length-based QC plots:

length_plot.r demultiplexed/lengths_*.txt.gz
```As output, this produces a multi-page PDF file,  _Sequence_curves.pdf_ . Here are some examples of the plots that are produced:





 **1. Read Count Frequency Curve** 





<img src="https://static.yanyin.tech/literature_test/protocol_io_true/protocols.io.b28fqhtn/Sequence_curves_00.png" alt="Read count frequency curve for five samples, showing a variety of different read length distributions" loading="lazy" title="Read count frequency curve for five samples, showing a variety of different read length distributions"/>



 **2. Called Bases Frequency Curve** 





  

<img src="https://static.yanyin.tech/literature_test/protocol_io_true/protocols.io.b28fqhtn/Sequence_curves_01.png" alt="Called bases frequency curve for five samples, showing a variety of different read length distributions" loading="lazy" title="Called bases frequency curve for five samples, showing a variety of different read length distributions"/>



 **3. Called Bases Density Curve** 





<img src="https://static.yanyin.tech/literature_test/protocol_io_true/protocols.io.b28fqhtn/Sequence_curves_02.png" alt="Sample-normalised called bases density curve for five samples, showing a variety of different read length distributions" loading="lazy" title="Sample-normalised called bases density curve for five samples, showing a variety of different read length distributions"/>

 **4. Cumulative Sequenced Bases Curve** 





<img src="https://static.yanyin.tech/literature_test/protocol_io_true/protocols.io.b28fqhtn/Sequence_curves_03.png" alt="Sample-normalised cumulative sequenced base curve for five samples, showing a variety of different read length distributions" loading="lazy" title="Sample-normalised cumulative sequenced base curve for five samples, showing a variety of different read length distributions"/>



 **5. Digital Electrophoresis Plot (relative frequency)** 





<img src="https://static.yanyin.tech/literature_test/protocol_io_true/protocols.io.b28fqhtn/Sequence_curves_04.png" alt="Relative digital electrophoresis plot for five samples, showing a variety of different read length distributions" loading="lazy" title="Relative digital electrophoresis plot for five samples, showing a variety of different read length distributions"/>



 **5. Digital Electrophoresis Plot (sample-normalised)** 





<img src="https://static.yanyin.tech/literature_test/protocol_io_true/protocols.io.b28fqhtn/Sequence_curves_05.png" alt="Sample-normalised digital electrophoresis plot for five samples, showing a variety of different read length distributions" loading="lazy" title="Sample-normalised digital electrophoresis plot for five samples, showing a variety of different read length distributions"/>

Downstream Workflows

Following on from here, cDNA reads can be oriented in preparation for stranded mapping:

Preparing Reads for Stranded Mapping Plasmid DNA sequences can be mapped or assembled:
Plasmid Sequence Analysis from Long Reads