Quickstart¶
Minimal example¶
Say we have the following:
seqs.fasta
: The multiply aligned reference sequences in FASTA format.tree.nwk
: The tree built from reference alignment, in Newick format.tree_stats.txt
: The log file from FastTree or the statistics file from RAxML/phyml.
Build a reference package my.refpkg
for a locus locus_name
(e.g. 16s_rRNA
) in a single command as follows:
taxit create -l 16s_rRNA -P my.refpkg \
--aln-fasta seqs.fasta \
--tree-stats tree_stats.txt \
--tree-file tree.nwk
Taxonomically equipped reference package example¶
The fun really begins when we incorporate taxonomic information. Say in addition to the above files you also have:
tax_ids.txt
: A list of all the tax_ids that the sequences in your refpkg use, one per line in a text file. Blank lines and Python style comments in this file will be ignored, and repeated entries are only considered once.seq_info.csv
: A CSV file describing each of the aligned sequences. It begins with one line giving the field names:"seqname","accession","tax_id","species_name","is_type"
This is followed by one line for each sequence in your multiple alignment. Only
seqname
andtax_id
are required.seqname
should match the ID of the sequence in the FASTA file and must be unique.accession
provides an additional identifier for the sequence, which can (but need not) be the same asseqname
. In our work,seqname
is often an RDP accession number andaccession
is the NCBI accession number corresponding to that RDP entry (note that an NCBI accession number may not uniquely identify a sequence, as in tha case of a genomic sequence with multiple 16S rRNA gene loci).``tax_id`` is the entry in the taxonomy this sequence is mapped to, andspecies_name
is the name associated with that entry.is_type
indicates whether this sequence is from a type strain or not (should beTRUE
orFALSE
). If you aren’t dealing with typestrains, just set it toFALSE
for all your sequences.
Now there are a few additional steps to construct a taxonomically-annotated refpkg.
First we have to represent the tax_ids as a taxonomy. You will need a copy of the NCBI taxonomy in a format taxtastic understands in order to extract the minimal taxonomy containing the tax_ids in tax_ids.txt
. Run:
taxit new_database taxonomy.db
This will download zipped files containing the NCBI taxonomy into the current directory and load the data into an sqlite databse named taxonomy.db
. This takes a while! Plan to go get a cup of coffee. Then run:
taxit taxtable taxonomy.db -f tax_ids.txt -o taxa.csv
The output, taxa.csv
, is a CSV file containing the minimum subtaxonomy of taxonomy.db
which contains all the entries in tax_ids.txt
.
Now we are ready to build the refpkg:
taxit create -l locus_name -P my.refpkg \
--taxonomy taxa.csv \
--aln-fasta seqs.fasta \
--seq-info seq_info.csv \
--tree-stats tree_stats.txt \
--tree-file tree.nwk
You may also want to add some metadata fields using the options --author
, --description
, and --package-version
, but these are optional. In addition, you may package in profile information and a Stockholm alignment:
--aln-sto seqs.sto \
--profile align_profile
where align_profile
is a profile that can be used to align query sequences using HMMER3 or Infernal, and seqs.sto
is the reference alignment in Stockholm format. This is handy if you plan to use cmalign
(in the Infernal suite of tools) or hmmalign
(in the HMMER3 suite of tools) to perform your alignments.