Stranded Mapping from Oriented Long Reads

David A Eccles

Published: 2022-03-18 DOI: 10.17504/protocols.io.yxmvm45nv3pe/v9

Abstract

This protocol demonstrates how to map strand-oriented long reads to a genome, and visualise them in a genome browser.

The general idea is to use minimap2 to create stranded BAM files, which are split for forward/reverse orientation then converted into BigWig format for display in a genome browser.

Input(s) :

stranded fastq files (see protocol Preparing Reads for Stranded Mapping)
a FASTA file containing the genome / sequence of interest. Output(s):
Genome-mapped stranded BAM files
Genome-mapped stranded BigWig files

Before start

You will need access to the following free and open-source software program(s):

And the following additional data file(s):

a FASTA file containing the genome / sequence of interest.

Steps

Orient Reads

Orient reads as per protocol Preparing Reads for Stranded Mapping.

If this has been done, then the following command should produce output without errors:

for bc in $(awk '{print $2}' barcode_counts.txt);
  do ls oriented/${bc}_reads_dirAdjusted.fq.gz;
done
```Example output:

oriented/BC03_reads_dirAdjusted.fq.gz oriented/BC04_reads_dirAdjusted.fq.gz oriented/BC05_reads_dirAdjusted.fq.gz oriented/BC06_reads_dirAdjusted.fq.gz oriented/BC07_reads_dirAdjusted.fq.gz oriented/BC08_reads_dirAdjusted.fq.gz

Index Preparation

Prepare genome index for spliced alignment:

Software

Value	Label
minimap2	NAME
Linux	OS_NAME
https://github.com/lh3/minimap2	REPOSITORY
Heng Li	DEVELOPER
https://github.com/lh3/minimap2/releases/tag/v2.24	LINK
2.24	VERSION

minimap2 -d GRCm39.primary_assembly.genome-splice.idx -Q -t 10 -x splice GRCm39.primary_assembly.genome.fa

Prepare a chromosome size index file:

samtools faidx GRCm39.primary_assembly.genome.fa
cat GRCm39.primary_assembly.genome.fa.fai | awk '{print $1"\t"$2}' > GRCm39.primary_assembly.genome.chrInfo.txt

Read Mapping

Map the long reads to the genome using minimap2, using samtools to covert to a sorted BAM format. This is where the reverse complementing done during demultiplexing gives a big saving of effort. As this BAM file is one of the main outputs, the run name is added to the file name.

Software

Value	Label
SAMtools	NAME
Linux	OS_NAME
https://github.com/samtools/samtools	REPOSITORY
Wellcome Trust Sanger Institute	DEVELOPER
http://www.htslib.org/download/	LINK
1.8	VERSION

runName="CHANGE_THIS";
indexFile="/index/location/GRCm39.primary_assembly.genome-splice.idx";
mkdir -p mapped;
for bc in $(awk '{print $2}' barcode_counts.txt);
  do echo ${bc};
  minimap2 -t 10 -a -x splice ${indexFile} oriented/${bc}_reads_dirAdjusted.fq.gz | \
    samtools view -b | samtools sort > mapped/mm2_${runName}_called_${bc}_vs_genome.bam;
done

Creating BigWig Coverage Files (not needed for JBrowse 2)

mpileupDC.pl

A bedGraph of coverage is created using BEDTools. The default options for BEDTools treat sequence deletions (which happen frequently in nanopore reads) as a drop in coverage, which can make exon hunting and coverage calculation more difficult. I submitted a pull request to add an additional ignoreD parameter to the command line to allow cDNA reads with split coverage across introns to ignore deletions when considering coverage; this request has now been incorporated into the main BEDtools repository (as of v2.30.0):

Software

Value	Label
BEDtools	NAME
Linux	OS_NAME
https://github.com/arq5x/bedtools2	REPOSITORY
Aaron Quinlan	DEVELOPER
https://github.com/arq5x/bedtools2/releases/tag/v2.30.0	LINK
2.30.0	VERSION

runName="CHANGE_THIS";
for bc in $(awk '{print $2}' barcode_counts.txt);
  do echo ${bc};
  basename="mapped/mm2_${runName}_called_${bc}_vs_genome";
  samtools view -b ${basename}.bam | \
    bedtools genomecov -bga -strand '+' -split -ignoreD -ibam - > ${basename}.bg.plus
  samtools view -b ${basename}.bam | \
    bedtools genomecov -bga -strand '-' -split -ignoreD -ibam - > ${basename}.bg.minus
  perl -i -pe 's/([0-9]+)$/-$1/' ${basename}.bg.minus
done;

Stranded bedgraph files are converted to bigwig. This requires BEDTools and a genome information file containing chromosome lengths, as generated in the first step:

runName="CHANGE_THIS";
infoFile="/index/location/GRCm39.primary_assembly.genome-splice.chrInfo.txt";
for bc in $(awk '{print $2}' barcode_counts.txt);
  do echo ${bc};
  basename="mapped/mm2_${runName}_called_${bc}_vs_genome"
  bedGraphToBigWig ${basename}.bg.plus ${infoFile} ${basename}.bw.plus
  bedGraphToBigWig ${basename}.bg.minus ${infoFile} ${basename}.bw.minus 
done

JBrowse 2 Configuration

Software

Value	Label
JBrowse 2	NAME
Linux	OS_NAME
https://github.com/GMOD/jbrowse-components	REPOSITORY
GMOD	DEVELOPER
https://github.com/GMOD/jbrowse-components/releases/tag/v1.6.7	LINK
1.6.7	VERSION

Each BAM file should have its own JBrowse rack. JBrowse 2 can create both alignment views and coverage plots from BAM files:

runName="CHANGE_THIS";
jbrowseLoc="/path/to/jbrowse2/folder";
for bc in $(awk '{print $2}' barcode_counts.txt);
  do echo ${bc};
  basename="mapped/mm2_${runName}_called_${bc}_vs_genome";
  # create index
  samtools index ${basename}.bam;
  # copy and create track in Jbrowse2 config file
  jbrowse add-track ${basename}.bam --load copy --out ${jbrowseLoc} --subDir bam
done;

If you have a "smart" file system mount that doesn't let you directly copy files via the jbrowse add-track command, you can use the /tmp file system and symlinks to do the same thing:

runName="CHANGE_THIS";
jbrowseLoc="/path/to/jbrowse2/folder";
mkdir -p /tmp/jbrowse/bam
rm -v /tmp/jbrowse/bam/*
mkdir -p ${jbrowseLoc}/bam
for bc in $(awk '{print $2}' barcode_counts.txt);
  do echo ${bc};
  basename="mapped/mm2_${runName}_called_${bc}_vs_genome";
  # create index
  samtools index ${basename}.bam;
  # copy JBrowse2 config file to temporary folder
  cp -v ${jbrowseLoc}/config.json /tmp/jbrowse/
  # copy and create track in Jbrowse2 config file
  jbrowse add-track ${basename}.bam --load symlink --out /tmp/jbrowse --subDir bam
  # copy BAM files and updated config file to JBrowse 2 folder
  cp -v /tmp/jbrowse/config.json ${jbrowseLoc}
  cp -v ${basename}.bam ${basename}.bam.bai ${jbrowseLoc}/bam
done;

Sanity Check

If this has worked properly, then mapping human or mouse to the mitochondrial genome with a LinearSNPCoverageDisplay in log scale should show a stepped expression, a bit like the Expected Results shown here:

Citation