Analyzing Metagenome Composition using the LIN taxonomic framework¶
Tessa Pierce Ward
March 2023
requires sourmash v4.8+
This tutorial uses the sourmash taxonomy
module, which was introduced via blog post
and was recently shown to perfom well for taxonomic profiling of long (and short) reads in Evaluation of taxonomic classification and profiling methods for long-read shotgun metagenomic sequencing datasets, Portik et al., 2022.
In this tutorial, we’ll use sourmash gather to analyze metagenomes using the LIN taxonomic framework. Specifically, we will analyze plant metagenomes with a low-level pathogen spike-in. The goal is to see if we can correctly assign the pathogen sequence to its LINgroup, which includes all known pathogenic strains.
barcode1
- highest spike-in (75 picogram/microliter pathogen DNA)barcode3
- lower spike-in (7.5 picogram/microliter pathogen DNA)barcode5
- no spike-in
The pathogen is Ralstonia solanacearum
in the Phylum IIB sequevar 1
group.
This data is courtesy of The Laboratory of Plant & Atmospheric Microbiology & (Meta)Genomics in collaboration with USDA APHIS.
Install sourmash¶
First, we need to install the software! We’ll use conda/mamba to do this.
The below command installs sourmash.
Install the software:
# create a new environment
mamba create -n smash -y -c conda-forge -c bioconda sourmash
then activate the conda environment:
conda activate smash
Victory conditions: your prompt should start with
(smash)
and you should now be able to runsourmash
and have it output usage information!!
Create a working subdirectory¶
Make a directory named smash_lin
, change into it:
mkdir -p ~/smash_lin
cd ~/smash_lin
Now make a couple useful folders:
mkdir -p inputs
mkdir -p databases
Download relevant data¶
First, download a database and taxonomic information¶
Here, we know the spike-in is a pathogenic seqevar of Ralstonia. We will download a database containing signatures of 27 Ralstonia genomes (pathogenic and not) and the corresponding taxonomic and lingroup information.
# database
curl -JLO https://osf.io/vxsta/download
mv ralstonia*.zip ./databases/ralstonia.zip
# taxonomy csv
curl -JLO https://raw.githubusercontent.com/bluegenes/2023-demo-sourmash-LIN/main/databases/ralstonia-lin.taxonomy.GCA-GCF.csv
mv ralstonia-lin.taxonomy.GCA-GCF.csv ./databases
# lingroup csv
curl -JLO https://raw.githubusercontent.com/bluegenes/2023-demo-sourmash-LIN/main/inputs/ralstonia.lingroups.csv
mv ralstonia.lingroups.csv ./databases
ls databases # look at the database files
Next, download pre-made sourmash signatures made from the input metagenomes¶
# download barcode 1 sig
curl -JLO https://osf.io/ujntr/download
mv barcode1_22142.sig.zip ./inputs/
# download barcode 3 signature
curl -JLO https://osf.io/2h9wx/download
mv barcode3_31543.sig.zip ./inputs
# download barcode 5 signature
curl -JLO https://osf.io/k8nw5/download
mv barcode5_36481.sig.zip ./inputs
# look at available input files
ls inputs
Look at the signatures¶
Let’s start with the barcode1
(highest spike-in) sample
First, let’s look at the metagenome signature.¶
By running sourmash sig fileinfo
, we can see information on the signatures available within the zip file.
Here, you can see I’ve generated the metagenome signature with scaled=1000
and built two ksizes, k=31
and k=51
Run:
sourmash sig fileinfo ./inputs/barcode1_22142.sig.zip
In the output, you should see:
** loading from './inputs/barcode1_22142.sig.zip'
path filetype: ZipFileLinearIndex
location: /home/jovyan/smash_lin/inputs/barcode1_22142.sig.zip
is database? yes
has manifest? yes
num signatures: 2
total hashes: 914328
summary of sketches:
1 sketches with DNA, k=31, scaled=1000, abund 426673 total hashes
1 sketches with DNA, k=51, scaled=1000, abund 487655 total hashes
We can also look at the database¶
Here, you can see I’ve generated the database with scaled=1000
and built three ksizes, k=21
, k=31
and k=51
Run:
sourmash sig fileinfo ./databases/ralstonia.zip
In the output, you should see:
** loading from './databases/ralstonia.zip'
path filetype: ZipFileLinearIndex
location: /home/jovyan/databases/ralstonia.zip
is database? yes
has manifest? yes
num signatures: 81
** examining manifest...
total hashes: 445041
summary of sketches:
27 sketches with DNA, k=21, scaled=1000, abund 148324 total hashes
27 sketches with DNA, k=31, scaled=1000, abund 148111 total hashes
27 sketches with DNA, k=51, scaled=1000, abund 148606 total hashes
There’s a lot of things to digest in this output but the two main ones are:
there are 27 genomes represented in this database, each of which are sketched at k=21,k=31,k=51
this database represents ~445 million k-mers (multiply number of hashes by the scaled number)
Run sourmash gather using ksize 51¶
Now let’s run sourmash gather
to find the closest reference genome(s) in the database.
If you want to read more about what sourmash is doing, please see Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers, Irber et al., 2022.
Run:
query="inputs/barcode1_22142.sig.zip"
database="databases/ralstonia.zip"
gather_csv_output="barcode1_22141.k51.gather.csv"
sourmash gather $query $database -k 51 -o $gather_csv_output
You should see the following output:
selecting specified query k=51
loaded query: barcode1_22142... (k=51, DNA)
--ading from 'databases/ralstonia.zip'...
loaded 81 total signatures from 1 locations.
after selecting signatures compatible with search, 27 remain.
Starting prefetch sweep across databases.
Found 7 signatures via prefetch; now doing gather.
overlap p_query p_match avg_abund
--------- ------- ------- ---------
105.0 kbp 0.0% 2.0% 1.0 GCA_002251655.1 Ralstonia solanacear...
found less than 50.0 kbp in common. => exiting
found 1 matches total;
the recovered matches hit 0.0% of the abundance-weighted query.
the recovered matches hit 0.0% of the query k-mers (unweighted).
The first step of gather found all potential matches (7), and the greedy algorithm narrowed this to a single best match, GCA_002251655.1
which shared an estimated 105 kbp with the metagenome (a very small percentage of the total dataset.) This is expected, though, since the dataset is a plant metagenome with a small Ralstonia
spike-in.
Add taxonomic information and summarize up lingroups¶
sourmash gather
finds the smallest set of reference genomes that contains all the known information (k-mers) in the metagenome. In most cases, gather
will find many metagenome matches. Here, we’re only looking for Ralstonia
matches and we only have a single gather result. Regardless, let’s use sourmash tax metagenome
to add taxonomic information and see if we’ve correctly assigned the pathogenic sequence.
First, let’s look at the relevant taxonomy files.¶
These commands will show the first few lines of each file. If you prefer, you can look at a more human-friendly view by opening the files in a spreadsheet program.
taxonomy_csv:
databases/ralstonia-lin.taxonomy.GCA-GCF.csv
the essential columns are
lin
(14;1;0;...
) andident
(GCF_00
…)
lingroups information:
databases/ralstonia.lingroups.csv
both columns are essential (
name
,lin
)
Look at the taxonomy file:
head -n 5 databases/ralstonia-lin.taxonomy.GCA-GCF.csv
You should see:
lin,species,strain,filename,accession,ident
14;1;0;0;0;0;0;0;0;0;6;0;1;0;1;0;0;0;0;0,Ralstonia solanacearum,OE1_1,GCF_001879565.1_ASM187956v1_genomic.fna,GCF_001879565.1,GCF_001879565.1
14;1;0;0;0;0;0;0;0;0;6;0;1;0;0;0;0;0;0;0,Ralstonia solanacearum,PSS1308,GCF_001870805.1_ASM187080v1_genomic.fna,GCF_001870805.1,GCF_001870805.1
14;1;0;0;0;0;0;0;0;0;2;1;0;0;0;0;0;0;0;0,Ralstonia solanacearum,FJAT_1458,GCF_001887535.1_ASM188753v1_genomic.fna,GCF_001887535.1,GCF_001887535.1
14;1;0;0;0;0;0;0;0;0;2;0;0;4;4;0;0;0;0;0,Ralstonia solanacearum,Pe_13,GCF_012062595.1_ASM1206259v1_genomic.fna,GCF_012062595.1,GCF_012062595.1
The key columns are:
ident
, containing identifiers matching the database sketches
lin
, containing the species information.
Now, let’s look at the lingroups file
head -n5 databases/ralstonia.lingroups.csv
You should see:
name,lin
Phyl II,14;1;0;0;0;3;0
Phyl IIA,14;1;0;0;0;3;0;1;0;0
Phyl IIB,14;1;0;0;0;3;0;0
Phyl IIB seq1 and seq2,14;1;0;0;0;3;0;0;0;0;1;0;0;0;0
Here, we have two columns:
name
- the name for each lingroup.
lin
- the LIN prefix corresponding to each group.
Now, run sourmash tax metagenome
to integrate taxonomic information into gather
results¶
Using the gather
output we generated above, we can integrate taxonomic information and summarize up “ranks” (lin positions). We can produce several different types of outputs, including a lingroup
report.
lingroup
format summarizes the taxonomic information at each lingroup
, and produces a report with 4 columns:
name
(from lingroups file)lin
(from lingroups file)percent_containment
- total % of the file matched to this lingroupnum_bp_contained
- estimated number of bp matched to this lingroup
Since sourmash assigns all k-mers to individual genomes, no reads/base pairs are “assigned” to higher taxonomic ranks or lingroups (as with Kraken-style LCA). Here, “percent_containment” and “num_bp_contained” is calculated by summarizing the assignments made to all genomes in a lingroup. This is akin to the “contained” information in Kraken-style reports.
Run tax metagenome
:
gather_csv_output="barcode1_22141.k51.gather.csv"
taxonomy_csv="databases/ralstonia-lin.taxonomy.GCA-GCF.csv"
lingroups_csv="databases/ralstonia.lingroups.csv"
sourmash tax metagenome -g $gather_csv_output -t $taxonomy_csv \
--lins --lingroup $lingroups_csv
You should see:
loaded 1 gather results from 'barcode1_22141.k51.gather.csv'.
loaded results for 1 queries from 1 gather CSVs
Starting summarization up rank(s): 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0
Read 11 lingroup rows and found 11 distinct lingroup prefixes.
and the results:
name lin percent_containment num_bp_contained
Phyl II 14;1;0;0;0;3;0 0.02 108000
Phyl IIB 14;1;0;0;0;3;0;0 0.02 108000
Phyl IIB seq1 and seq2 14;1;0;0;0;3;0;0;0;0;1;0;0;0;0 0.02 108000
IIB seq1 14;1;0;0;0;3;0;0;0;0;1;0;0;0;0;0;0 0.02 108000
Here, the most specific lingroup we assign to is Phyl IIB seq1
, which is the pathogenic lingroup that was spiked in, yay! Note that the other groups in the output all contain this group.
Now output the lingroup report to a file (instead of to the terminal)¶
use -o
to provide an output basename for taxonomic output.
gather_csv_output="barcode1_22141.k51.gather.csv"
taxonomy_csv="databases/ralstonia-lin.taxonomy.GCA-GCF.csv"
lingroups_csv="databases/ralstonia.lingroups.csv"
sourmash tax metagenome -g $gather_csv_output -t $taxonomy_csv \
--lins --lingroup $lingroups_csv \
-o "barcode1"
You should see
saving 'lingroup' output to 'barcode1.lingroup.tsv'
in the output.
Optionally, write multiple output formats¶
You can use -F
to specify additional output formats. Here, I’ve added csv_summary
. Note that while the lingroup
format will be generated automatically if you specify the --lingroup
file, you can also specify it with -F lingroup
if you want, as I’ve done here.
Run:
gather_csv_output="barcode1_22141.k51.gather.csv"
taxonomy_csv="databases/ralstonia-lin.taxonomy.GCA-GCF.csv"
lingroups_csv="databases/ralstonia.lingroups.csv"
sourmash tax metagenome -g $gather_csv_output -t $taxonomy_csv \
--lins --lingroup $lingroups_csv \
-F lingroup csv_summary -o "barcode1"
You should see the following in the output:
saving 'csv_summary' output to 'barcode1.summarized.csv'.
saving 'lingroup' output to 'barcode1.lingroup.txt'.
The csv_summary
format is the full summary of this sample, e.g. the summary at each taxonomic rank (LIN position). It also includes an entry with the unclassified
portion at each rank.
Note: Multiple output formats require the
-o
--output-base
to be specified, as each must be written to a file.
Here’s an abbreviated version of the gather
results for barcode1
, with lingroup information added:
ksize |
scaled |
best overlap |
gather match(es) |
lingroup |
lin |
|
---|---|---|---|---|---|---|
bc1 |
51 |
1000 |
105 kb |
GCA_002251655.1 |
IIB seq1 |
14;1;0;0;0;3;0;0;0;0;1;0;0;0;0;0;0 |
bc1 |
31 |
1000 |
173 kb |
GCA_002251655.1 |
IIB seq1 |
14;1;0;0;0;3;0;0;0;0;1;0;0;0;0;0;0 |
Now run with barcode3
sample¶
sourmash gather¶
Run:
query="inputs/barcode3_31543.sig.zip"
database="databases/ralstonia.zip"
gather_csv_output="barcode3_31543.dna.k51.gather.csv"
sourmash gather $query $database -k 51 -o $gather_csv_output
You should see:
selecting specified query k=51
loaded query: barcode3_31543... (k=51, DNA)
loading from 'databases/ralstonia.zip'...
loaded 81 total signatures from 1 locations.
after selecting signatures compatible with search, 27 remain.
Starting prefetch sweep across databases.
Found 0 signatures via prefetch; now doing gather.
found less than 50.0 kbp in common. => exiting
found 0 matches total;
the recovered matches hit 0.0% of the query k-mers (unweighted).
gather found no sequence matches! But, we can lower the detection threshold:¶
query="inputs/barcode3_31543.sig.zip"
database="databases/ralstonia.zip"
gather_csv_output="barcode3_31543.k51.gather.csv"
# use a 10kb detection threshold
sourmash gather $query $database -k 51 --threshold-bp 10000 -o $gather_csv_output
This time, you should see:
selecting specified query k=51
loaded query: barcode3_31543... (k=51, DNA)
loading from 'databases/ralstonia.zip'...
loaded 81 total signatures from 1 locations.
after selecting signatures compatible with search, 27 remain.
Starting prefetch sweep across databases.
Found 6 signatures via prefetch; now doing gather.
overlap p_query p_match avg_abund
--------- ------- ------- ---------
12.0 kbp 0.0% 0.2% 1.0 GCA_000750575.1 Ralstonia solanacear...
found 1 matches total;
the recovered matches hit 0.0% of the abundance-weighted query.
the recovered matches hit 0.0% of the query k-mers (unweighted).
You’ll notice that while we have an estimated ~12kbp overlap, the matched genome (GCA_000750575.1
) is different from the one matched above for barcode5
. If you run sourmash tax metagenome
on this output, you’ll see that this genome belongs to Phyl IIB seq 2
group, which is a sister group to the correct Phyl IIB seq 1
group that we expected. So we have a match but it’s not the right one – why not?
What happened? Use prefetch
to investigate¶
sourmash gather
has two steps: first, it runs a prefetch
to find ALL genome matches, and then uses a greedy approach to select the smallest set of genomes that contain (‘cover’) all known sequence content. Let’s run prefetch
independently so we can look at the results of the first step. Here, let’s use --threshold-bp 0
to get all possible matches.
Run:
query="inputs/barcode3_31543.sig.zip"
prefetch_csv_output="barcode3_31543.k51.prefetch.csv"
database="databases/ralstonia.zip"
sourmash prefetch $query $database -k 51 --threshold-bp 0 -o $prefetch_csv_output
You should see:
selecting specified query k=51
loaded query: barcode3_31543... (k=51, DNA)
query sketch has scaled=1000; will be dynamically downsampled as needed.
--tal of 10 matching signatures so far.tonia.zip'
loaded 81 total signatures from 1 locations.
after selecting signatures compatible with search, 27 remain.
--
total of 15 matching signatures.
saved 15 matches to CSV file 'barcode3_31543.k51.prefetch.csv'
of 487043 distinct query hashes, 12 were found in matches above threshold.
a total of 487031 query hashes remain unmatched.
final scaled value (max across query and all matches) is 1000
Here, the output is telling us we found matches to 15 of the 27 Ralstonia genomes. But only 12 k-mers were shared between the metagenome sample and the genomes. Remember that sourmash uses a representative subsample of all k-mers, so here these 12 k-mers represent ~ 12kb of sequence (12 * scaled). We’ve found that this is sufficient to detect presence of an organism, but at this low level, it can be hard to distinguish between closely-related genomes. Let’s open the prefetch output to see how those 12 k-mers matched between different genomes.
Look at the barcode3_31543.k51.prefetch.csv
file¶
Use a spreadsheet program on your computer or use
less -S barcode3_31543.k51.prefetch.csv
to see the file on the terminal. If usingless
, hitq
when you want to exit and return to your terminal prompt.
The first column contains the estimated number of base pairs matched between our query and each matching reference genome. You’ll notice there are four genomes that match 12kb of sequence, one of which is the “correct” genome (GCA_002251655.1
, which is in the IIB seq1
lingroup).
What is happening here?
When faced with equally good matches, sourmash gather
makes a random choice about which genome to assign these k-mers to. This happens primarily with highly similar genomes and/or very small sequence matches. If this happens and you need to distinguish between these genomes, we recommend trying a lower scaled value (higher resolution). “scaled” refers to the systematic downsampling: we keep rougly 1/scaled k-mers (scaled=1000
keeps ~1 of every 1000 unique k-mers). scaled=1
keeps all k-mers, but our signature storage is not optimized for this use case.
To see if we could robustly assign the correct sequevar for barcode3
using a higher resolution sketch, I also ran gather
using scaled=100
.
Here’s an abbreviated version of the gather
results for barcode3
, with lingroup information added:
ksize |
scaled |
best overlap |
gather match(es) |
lingroup |
lin |
|
---|---|---|---|---|---|---|
bc3 |
51 |
1000 |
12kb |
GCA_000750575.1 |
IIB seq2 |
14;1;0;0;0;3;0;0;0;0;1;0;0;0;0;1;0 |
bc3 |
31 |
1000 |
28 kb |
GCA_002251655.1 |
IIB seq1 |
14;1;0;0;0;3;0;0;0;0;1;0;0;0;0;0;0 |
bc3 |
51 |
100 |
14.8 kb |
GCA_002251655.1 |
IIB seq1 |
14;1;0;0;0;3;0;0;0;0;1;0;0;0;0;0;0 |
bc3 |
31 |
100 |
21.1 kb |
GCA_002251655.1 |
IIB seq1 |
14;1;0;0;0;3;0;0;0;0;1;0;0;0;0;0;0 |
We typically use k=51 for strain-level matching and k=31 for species-level matching. Notice that running at k=31 with scaled 1000 found the right match. However, if you run prefetch for k=31
, you see there are three matches with 28kb
overlap, so we just got lucky that gather
selected the right one for this test case.
In contrast, by sketching the Ralstonia
genomes and metagenome at higher resolution (scaled=100
), we had sufficient information to correctly assign the sequence to the IIB seq1
lingroup at either ksize.
Now try the barcode5
sample¶
You can also run the barcode5
file using the same commands as above:
query="inputs/barcode5_36481.sig.zip"
database="databases/ralstonia.zip"
gather_csv_output="barcode5_36481.dna.k51.gather.csv"
sourmash gather $query $database -k 51 -o $gather_csv_output
You should see:
selecting specified query k=51
loaded query: barcode5_36481... (k=51, DNA)
--
loaded 81 total signatures from 1 locations.
after selecting signatures compatible with search, 27 remain.
Starting prefetch sweep across databases.
Found 0 signatures via prefetch; now doing gather.
found less than 50.0 kbp in common. => exiting
found 0 matches total;
the recovered matches hit 0.0% of the query k-mers (unweighted).
No matches are found. If you drop the threshold-bp to 0 (--threshold-bp 0
), you can find ~1kbp overlap (a single k-mer match!). Note, we do not recommend trusting/using results with fewer than 3 k-mer matches (3kbp at scaled=1000). Especially in larger databases (e.g. NCBI/GTDB), a single k-mer match might actually be from contamination in the reference genome rather than true genome content, so you may end up assigning the wrong lineage. Requiring 3 k-mers (representing ~3kb of matching sequence) makes it more likely your matches represent true genome content.
I then ran this file at higher resolution to see how the results changed. In each case, very few k-mers matched and we could not robustly identify a specific Ralstonia
genome or lingroup. As it turns out, barcode5
does not have a Ralstonia
spike-in, so this is a good thing!
Here’s an abbreviated version of the gather
results for barcode5
, with lingroup information added in cases with a single gather match:
ksize |
scaled |
best overlap |
gather match(es) |
lingroup |
lin |
|
---|---|---|---|---|---|---|
bc5 |
51 |
1000 |
1 kbp |
GCA_000750575.1 |
IIB seq2 |
14;1;0;0;0;3;0;0;0;0;1;0;0;0;0;1;0 |
bc5 |
31 |
1000 |
0 |
N/A |
||
bc5 |
51 |
100 |
300bp |
all |
||
bc5 |
31 |
100 |
1.2 kb |
all |
||
bc5 |
51 |
10 |
120 bp |
all |
||
bc5 |
31 |
10 |
670 bp |
all |
||
bc5 |
51 |
5 |
150 bp |
all |
||
bc5 |
31 |
5 |
500 bp |
all |
Again, while I’ve used a threshold-bp of 0 to get the gather match at scaled=1000, we do not typically trust gather matches with less than 3*scaled
overlap (< 3 k-mers matched). Even at very high resolution (scaled=5), we matched nearly all Ralstonia genomes and could not distinguish a single lingroup.
We typically recommend running at scaled=1000
(our default), as this works for most microbial use cases. You can run at higher resolution (lower scaled) if you need to, but higher resolution signatures are larger and can take significantly longer to build and search - use at your own risk :).
Summary and concluding thoughts¶
The LIN taxonomic framework may be useful distinguishing groups below the species level.
We can now use LINs and lingroups with sourmash tax metagenome
. For low level matches, the gather greedy
approach can struggle. We are working on ways to better warn users about this behavior and welcome
feedback and suggestions on our issue tracker.