Analyzing Metagenome Composition using the LIN taxonomic framework

Tessa Pierce Ward

March 2023

requires sourmash v4.8+


This tutorial uses the sourmash taxonomy module, which was introduced via blog post and was recently shown to perfom well for taxonomic profiling of long (and short) reads in Evaluation of taxonomic classification and profiling methods for long-read shotgun metagenomic sequencing datasets, Portik et al., 2022.

In this tutorial, we’ll use sourmash gather to analyze metagenomes using the LIN taxonomic framework. Specifically, we will analyze plant metagenomes with a low-level pathogen spike-in. The goal is to see if we can correctly assign the pathogen sequence to its LINgroup, which includes all known pathogenic strains.

  • barcode1 - highest spike-in (75 picogram/microliter pathogen DNA)

  • barcode3 - lower spike-in (7.5 picogram/microliter pathogen DNA)

  • barcode5 - no spike-in

The pathogen is Ralstonia solanacearum in the Phylum IIB sequevar 1 group.

This data is courtesy of The Laboratory of Plant & Atmospheric Microbiology & (Meta)Genomics in collaboration with USDA APHIS.

Install sourmash

First, we need to install the software! We’ll use conda/mamba to do this.

The below command installs sourmash.

Install the software:

# create a new environment
mamba create -n smash -y -c conda-forge -c bioconda sourmash

then activate the conda environment:

conda activate smash

Victory conditions: your prompt should start with (smash) and you should now be able to run sourmash and have it output usage information!!

Create a working subdirectory

Make a directory named smash_lin, change into it:

mkdir -p ~/smash_lin
cd ~/smash_lin

Now make a couple useful folders:

mkdir -p inputs
mkdir -p databases

Download relevant data

First, download a database and taxonomic information

Here, we know the spike-in is a pathogenic seqevar of Ralstonia. We will download a database containing signatures of 27 Ralstonia genomes (pathogenic and not) and the corresponding taxonomic and lingroup information.

# database
curl -JLO https://osf.io/vxsta/download
mv ralstonia*.zip ./databases/ralstonia.zip

# taxonomy csv
curl -JLO https://raw.githubusercontent.com/bluegenes/2023-demo-sourmash-LIN/main/databases/ralstonia-lin.taxonomy.GCA-GCF.csv
mv ralstonia-lin.taxonomy.GCA-GCF.csv ./databases

# lingroup csv
curl -JLO https://raw.githubusercontent.com/bluegenes/2023-demo-sourmash-LIN/main/inputs/ralstonia.lingroups.csv
mv ralstonia.lingroups.csv ./databases

ls databases # look at the database files

Next, download pre-made sourmash signatures made from the input metagenomes

# download barcode 1 sig
curl -JLO https://osf.io/ujntr/download
mv barcode1_22142.sig.zip ./inputs/

# download barcode 3 signature
curl -JLO https://osf.io/2h9wx/download
mv barcode3_31543.sig.zip ./inputs

# download barcode 5 signature
curl -JLO https://osf.io/k8nw5/download
mv barcode5_36481.sig.zip ./inputs

# look at available input files
ls inputs

Look at the signatures

Let’s start with the barcode1 (highest spike-in) sample

First, let’s look at the metagenome signature.

By running sourmash sig fileinfo, we can see information on the signatures available within the zip file.

Here, you can see I’ve generated the metagenome signature with scaled=1000 and built two ksizes, k=31 and k=51

Run:

sourmash sig fileinfo ./inputs/barcode1_22142.sig.zip

In the output, you should see:

** loading from './inputs/barcode1_22142.sig.zip'
path filetype: ZipFileLinearIndex
location: /home/jovyan/smash_lin/inputs/barcode1_22142.sig.zip
is database? yes
has manifest? yes
num signatures: 2
total hashes: 914328
summary of sketches:
   1 sketches with DNA, k=31, scaled=1000, abund      426673 total hashes
   1 sketches with DNA, k=51, scaled=1000, abund      487655 total hashes

We can also look at the database

Here, you can see I’ve generated the database with scaled=1000 and built three ksizes, k=21, k=31 and k=51

Run:

sourmash sig fileinfo ./databases/ralstonia.zip

In the output, you should see:

** loading from './databases/ralstonia.zip'
path filetype: ZipFileLinearIndex
location: /home/jovyan/databases/ralstonia.zip
is database? yes
has manifest? yes
num signatures: 81
** examining manifest...
total hashes: 445041
summary of sketches:
   27 sketches with DNA, k=21, scaled=1000, abund     148324 total hashes
   27 sketches with DNA, k=31, scaled=1000, abund     148111 total hashes
   27 sketches with DNA, k=51, scaled=1000, abund     148606 total hashes

There’s a lot of things to digest in this output but the two main ones are:

  • there are 27 genomes represented in this database, each of which are sketched at k=21,k=31,k=51

  • this database represents ~445 million k-mers (multiply number of hashes by the scaled number)

Run sourmash gather using ksize 51

Now let’s run sourmash gather to find the closest reference genome(s) in the database. If you want to read more about what sourmash is doing, please see Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers, Irber et al., 2022.

Run:

query="inputs/barcode1_22142.sig.zip"
database="databases/ralstonia.zip"

gather_csv_output="barcode1_22141.k51.gather.csv"

sourmash gather $query $database -k 51 -o $gather_csv_output

You should see the following output:

selecting specified query k=51
loaded query: barcode1_22142... (k=51, DNA)
--ading from 'databases/ralstonia.zip'...
loaded 81 total signatures from 1 locations.
after selecting signatures compatible with search, 27 remain.
Starting prefetch sweep across databases.

Found 7 signatures via prefetch; now doing gather.

overlap     p_query p_match avg_abund
---------   ------- ------- ---------
105.0 kbp      0.0%    2.0%       1.0    GCA_002251655.1 Ralstonia solanacear...
found less than 50.0 kbp in common. => exiting

found 1 matches total;
the recovered matches hit 0.0% of the abundance-weighted query.
the recovered matches hit 0.0% of the query k-mers (unweighted).

The first step of gather found all potential matches (7), and the greedy algorithm narrowed this to a single best match, GCA_002251655.1 which shared an estimated 105 kbp with the metagenome (a very small percentage of the total dataset.) This is expected, though, since the dataset is a plant metagenome with a small Ralstonia spike-in.

Add taxonomic information and summarize up lingroups

sourmash gather finds the smallest set of reference genomes that contains all the known information (k-mers) in the metagenome. In most cases, gather will find many metagenome matches. Here, we’re only looking for Ralstonia matches and we only have a single gather result. Regardless, let’s use sourmash tax metagenome to add taxonomic information and see if we’ve correctly assigned the pathogenic sequence.

First, let’s look at the relevant taxonomy files.

These commands will show the first few lines of each file. If you prefer, you can look at a more human-friendly view by opening the files in a spreadsheet program.

  • taxonomy_csv: databases/ralstonia-lin.taxonomy.GCA-GCF.csv

    • the essential columns are lin (14;1;0;...) and ident (GCF_00…)

  • lingroups information: databases/ralstonia.lingroups.csv

    • both columns are essential (name, lin)

Look at the taxonomy file:

head -n 5 databases/ralstonia-lin.taxonomy.GCA-GCF.csv

You should see:

lin,species,strain,filename,accession,ident
14;1;0;0;0;0;0;0;0;0;6;0;1;0;1;0;0;0;0;0,Ralstonia solanacearum,OE1_1,GCF_001879565.1_ASM187956v1_genomic.fna,GCF_001879565.1,GCF_001879565.1
14;1;0;0;0;0;0;0;0;0;6;0;1;0;0;0;0;0;0;0,Ralstonia solanacearum,PSS1308,GCF_001870805.1_ASM187080v1_genomic.fna,GCF_001870805.1,GCF_001870805.1
14;1;0;0;0;0;0;0;0;0;2;1;0;0;0;0;0;0;0;0,Ralstonia solanacearum,FJAT_1458,GCF_001887535.1_ASM188753v1_genomic.fna,GCF_001887535.1,GCF_001887535.1
14;1;0;0;0;0;0;0;0;0;2;0;0;4;4;0;0;0;0;0,Ralstonia solanacearum,Pe_13,GCF_012062595.1_ASM1206259v1_genomic.fna,GCF_012062595.1,GCF_012062595.1

The key columns are:

  • ident, containing identifiers matching the database sketches

  • lin, containing the species information.

Now, let’s look at the lingroups file

head -n5 databases/ralstonia.lingroups.csv

You should see:

name,lin
Phyl II,14;1;0;0;0;3;0
Phyl IIA,14;1;0;0;0;3;0;1;0;0
Phyl IIB,14;1;0;0;0;3;0;0
Phyl IIB seq1 and seq2,14;1;0;0;0;3;0;0;0;0;1;0;0;0;0

Here, we have two columns:

  • name - the name for each lingroup.

  • lin - the LIN prefix corresponding to each group.

Now, run sourmash tax metagenome to integrate taxonomic information into gather results

Using the gather output we generated above, we can integrate taxonomic information and summarize up “ranks” (lin positions). We can produce several different types of outputs, including a lingroup report.

lingroup format summarizes the taxonomic information at each lingroup, and produces a report with 4 columns:

  • name (from lingroups file)

  • lin (from lingroups file)

  • percent_containment - total % of the file matched to this lingroup

  • num_bp_contained - estimated number of bp matched to this lingroup

Since sourmash assigns all k-mers to individual genomes, no reads/base pairs are “assigned” to higher taxonomic ranks or lingroups (as with Kraken-style LCA). Here, “percent_containment” and “num_bp_contained” is calculated by summarizing the assignments made to all genomes in a lingroup. This is akin to the “contained” information in Kraken-style reports.

Run tax metagenome:

gather_csv_output="barcode1_22141.k51.gather.csv"
taxonomy_csv="databases/ralstonia-lin.taxonomy.GCA-GCF.csv"
lingroups_csv="databases/ralstonia.lingroups.csv"

sourmash tax metagenome -g $gather_csv_output -t $taxonomy_csv \
                        --lins --lingroup $lingroups_csv

You should see:

loaded 1 gather results from 'barcode1_22141.k51.gather.csv'.
loaded results for 1 queries from 1 gather CSVs
Starting summarization up rank(s): 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 
Read 11 lingroup rows and found 11 distinct lingroup prefixes.

and the results:

name	lin	percent_containment	num_bp_contained
Phyl II	14;1;0;0;0;3;0	0.02	108000
Phyl IIB	14;1;0;0;0;3;0;0	0.02	108000
Phyl IIB seq1 and seq2	14;1;0;0;0;3;0;0;0;0;1;0;0;0;0	0.02	108000
IIB seq1	14;1;0;0;0;3;0;0;0;0;1;0;0;0;0;0;0	0.02	108000

Here, the most specific lingroup we assign to is Phyl IIB seq1, which is the pathogenic lingroup that was spiked in, yay! Note that the other groups in the output all contain this group.

Now output the lingroup report to a file (instead of to the terminal)

use -o to provide an output basename for taxonomic output.

gather_csv_output="barcode1_22141.k51.gather.csv"
taxonomy_csv="databases/ralstonia-lin.taxonomy.GCA-GCF.csv"
lingroups_csv="databases/ralstonia.lingroups.csv"

sourmash tax metagenome -g $gather_csv_output -t $taxonomy_csv \
                        --lins --lingroup $lingroups_csv \
                        -o "barcode1"

You should see saving 'lingroup' output to 'barcode1.lingroup.tsv' in the output.

Optionally, write multiple output formats

You can use -F to specify additional output formats. Here, I’ve added csv_summary. Note that while the lingroup format will be generated automatically if you specify the --lingroup file, you can also specify it with -F lingroup if you want, as I’ve done here.

Run:

gather_csv_output="barcode1_22141.k51.gather.csv"
taxonomy_csv="databases/ralstonia-lin.taxonomy.GCA-GCF.csv"
lingroups_csv="databases/ralstonia.lingroups.csv"

sourmash tax metagenome -g $gather_csv_output -t $taxonomy_csv \
                        --lins --lingroup $lingroups_csv \
                        -F lingroup csv_summary -o "barcode1"

You should see the following in the output:

saving 'csv_summary' output to 'barcode1.summarized.csv'.
saving 'lingroup' output to 'barcode1.lingroup.txt'.

The csv_summary format is the full summary of this sample, e.g. the summary at each taxonomic rank (LIN position). It also includes an entry with the unclassified portion at each rank.

Note: Multiple output formats require the -o --output-base to be specified, as each must be written to a file.

Here’s an abbreviated version of the gather results for barcode1, with lingroup information added:

ksize

scaled

best overlap

gather match(es)

lingroup

lin

bc1

51

1000

105 kb

GCA_002251655.1

IIB seq1

14;1;0;0;0;3;0;0;0;0;1;0;0;0;0;0;0

bc1

31

1000

173 kb

GCA_002251655.1

IIB seq1

14;1;0;0;0;3;0;0;0;0;1;0;0;0;0;0;0

Now run with barcode3 sample

sourmash gather

Run:

query="inputs/barcode3_31543.sig.zip"
database="databases/ralstonia.zip"

gather_csv_output="barcode3_31543.dna.k51.gather.csv"

sourmash gather $query $database -k 51 -o $gather_csv_output

You should see:

selecting specified query k=51
loaded query: barcode3_31543... (k=51, DNA)
loading from 'databases/ralstonia.zip'...
loaded 81 total signatures from 1 locations.
after selecting signatures compatible with search, 27 remain.
Starting prefetch sweep across databases.
Found 0 signatures via prefetch; now doing gather.
found less than 50.0 kbp in common. => exiting


found 0 matches total;
the recovered matches hit 0.0% of the query k-mers (unweighted).

gather found no sequence matches! But, we can lower the detection threshold:

query="inputs/barcode3_31543.sig.zip"
database="databases/ralstonia.zip"
gather_csv_output="barcode3_31543.k51.gather.csv"

# use a 10kb detection threshold
sourmash gather $query $database -k 51 --threshold-bp 10000 -o $gather_csv_output

This time, you should see:

selecting specified query k=51
loaded query: barcode3_31543... (k=51, DNA)
loading from 'databases/ralstonia.zip'...
loaded 81 total signatures from 1 locations.
after selecting signatures compatible with search, 27 remain.
Starting prefetch sweep across databases.
Found 6 signatures via prefetch; now doing gather.


overlap     p_query p_match avg_abund
---------   ------- ------- ---------
12.0 kbp       0.0%    0.2%       1.0    GCA_000750575.1 Ralstonia solanacear...

found 1 matches total;
the recovered matches hit 0.0% of the abundance-weighted query.
the recovered matches hit 0.0% of the query k-mers (unweighted).

You’ll notice that while we have an estimated ~12kbp overlap, the matched genome (GCA_000750575.1) is different from the one matched above for barcode5. If you run sourmash tax metagenome on this output, you’ll see that this genome belongs to Phyl IIB seq 2 group, which is a sister group to the correct Phyl IIB seq 1 group that we expected. So we have a match but it’s not the right one – why not?

What happened? Use prefetch to investigate

sourmash gather has two steps: first, it runs a prefetch to find ALL genome matches, and then uses a greedy approach to select the smallest set of genomes that contain (‘cover’) all known sequence content. Let’s run prefetch independently so we can look at the results of the first step. Here, let’s use --threshold-bp 0 to get all possible matches.

Run:

query="inputs/barcode3_31543.sig.zip"
prefetch_csv_output="barcode3_31543.k51.prefetch.csv"
database="databases/ralstonia.zip"

sourmash prefetch $query $database -k 51 --threshold-bp 0 -o $prefetch_csv_output 

You should see:

selecting specified query k=51
loaded query: barcode3_31543... (k=51, DNA)
query sketch has scaled=1000; will be dynamically downsampled as needed.
--tal of 10 matching signatures so far.tonia.zip'
loaded 81 total signatures from 1 locations.
after selecting signatures compatible with search, 27 remain.
--
total of 15 matching signatures.
saved 15 matches to CSV file 'barcode3_31543.k51.prefetch.csv'
of 487043 distinct query hashes, 12 were found in matches above threshold.
a total of 487031 query hashes remain unmatched.
final scaled value (max across query and all matches) is 1000

Here, the output is telling us we found matches to 15 of the 27 Ralstonia genomes. But only 12 k-mers were shared between the metagenome sample and the genomes. Remember that sourmash uses a representative subsample of all k-mers, so here these 12 k-mers represent ~ 12kb of sequence (12 * scaled). We’ve found that this is sufficient to detect presence of an organism, but at this low level, it can be hard to distinguish between closely-related genomes. Let’s open the prefetch output to see how those 12 k-mers matched between different genomes.

Look at the barcode3_31543.k51.prefetch.csv file

Use a spreadsheet program on your computer or use less -S barcode3_31543.k51.prefetch.csv to see the file on the terminal. If using less, hit q when you want to exit and return to your terminal prompt.

The first column contains the estimated number of base pairs matched between our query and each matching reference genome. You’ll notice there are four genomes that match 12kb of sequence, one of which is the “correct” genome (GCA_002251655.1, which is in the IIB seq1 lingroup).

What is happening here?

When faced with equally good matches, sourmash gather makes a random choice about which genome to assign these k-mers to. This happens primarily with highly similar genomes and/or very small sequence matches. If this happens and you need to distinguish between these genomes, we recommend trying a lower scaled value (higher resolution). “scaled” refers to the systematic downsampling: we keep rougly 1/scaled k-mers (scaled=1000 keeps ~1 of every 1000 unique k-mers). scaled=1 keeps all k-mers, but our signature storage is not optimized for this use case.

To see if we could robustly assign the correct sequevar for barcode3 using a higher resolution sketch, I also ran gather using scaled=100.

Here’s an abbreviated version of the gather results for barcode3, with lingroup information added:

ksize

scaled

best overlap

gather match(es)

lingroup

lin

bc3

51

1000

12kb

GCA_000750575.1

IIB seq2

14;1;0;0;0;3;0;0;0;0;1;0;0;0;0;1;0

bc3

31

1000

28 kb

GCA_002251655.1

IIB seq1

14;1;0;0;0;3;0;0;0;0;1;0;0;0;0;0;0

bc3

51

100

14.8 kb

GCA_002251655.1

IIB seq1

14;1;0;0;0;3;0;0;0;0;1;0;0;0;0;0;0

bc3

31

100

21.1 kb

GCA_002251655.1

IIB seq1

14;1;0;0;0;3;0;0;0;0;1;0;0;0;0;0;0

We typically use k=51 for strain-level matching and k=31 for species-level matching. Notice that running at k=31 with scaled 1000 found the right match. However, if you run prefetch for k=31, you see there are three matches with 28kb overlap, so we just got lucky that gather selected the right one for this test case.

In contrast, by sketching the Ralstonia genomes and metagenome at higher resolution (scaled=100), we had sufficient information to correctly assign the sequence to the IIB seq1 lingroup at either ksize.

Now try the barcode5 sample

You can also run the barcode5 file using the same commands as above:

query="inputs/barcode5_36481.sig.zip"
database="databases/ralstonia.zip"

gather_csv_output="barcode5_36481.dna.k51.gather.csv"

sourmash gather $query $database -k 51 -o $gather_csv_output

You should see:

selecting specified query k=51
loaded query: barcode5_36481... (k=51, DNA)
--
loaded 81 total signatures from 1 locations.
after selecting signatures compatible with search, 27 remain.

Starting prefetch sweep across databases.
Found 0 signatures via prefetch; now doing gather.
found less than 50.0 kbp in common. => exiting

found 0 matches total;
the recovered matches hit 0.0% of the query k-mers (unweighted).

No matches are found. If you drop the threshold-bp to 0 (--threshold-bp 0), you can find ~1kbp overlap (a single k-mer match!). Note, we do not recommend trusting/using results with fewer than 3 k-mer matches (3kbp at scaled=1000). Especially in larger databases (e.g. NCBI/GTDB), a single k-mer match might actually be from contamination in the reference genome rather than true genome content, so you may end up assigning the wrong lineage. Requiring 3 k-mers (representing ~3kb of matching sequence) makes it more likely your matches represent true genome content.

I then ran this file at higher resolution to see how the results changed. In each case, very few k-mers matched and we could not robustly identify a specific Ralstonia genome or lingroup. As it turns out, barcode5 does not have a Ralstonia spike-in, so this is a good thing!

Here’s an abbreviated version of the gather results for barcode5, with lingroup information added in cases with a single gather match:

ksize

scaled

best overlap

gather match(es)

lingroup

lin

bc5

51

1000

1 kbp

GCA_000750575.1

IIB seq2

14;1;0;0;0;3;0;0;0;0;1;0;0;0;0;1;0

bc5

31

1000

0

N/A

bc5

51

100

300bp

all

bc5

31

100

1.2 kb

all

bc5

51

10

120 bp

all

bc5

31

10

670 bp

all

bc5

51

5

150 bp

all

bc5

31

5

500 bp

all

Again, while I’ve used a threshold-bp of 0 to get the gather match at scaled=1000, we do not typically trust gather matches with less than 3*scaled overlap (< 3 k-mers matched). Even at very high resolution (scaled=5), we matched nearly all Ralstonia genomes and could not distinguish a single lingroup.

We typically recommend running at scaled=1000 (our default), as this works for most microbial use cases. You can run at higher resolution (lower scaled) if you need to, but higher resolution signatures are larger and can take significantly longer to build and search - use at your own risk :).

Summary and concluding thoughts

The LIN taxonomic framework may be useful distinguishing groups below the species level. We can now use LINs and lingroups with sourmash tax metagenome. For low level matches, the gather greedy approach can struggle. We are working on ways to better warn users about this behavior and welcome feedback and suggestions on our issue tracker.