Quickstart

Minimal example

Say we have the following:

  • seqs.fasta: The multiply aligned reference sequences in FASTA format.

  • tree.nwk: The tree built from reference alignment, in Newick format.

  • tree_stats.txt: The log file from FastTree or the statistics file from RAxML/phyml.

Build a reference package my.refpkg for a locus locus_name (e.g. 16s_rRNA) in a single command as follows:

taxit create -l 16s_rRNA -P my.refpkg \
    --aln-fasta seqs.fasta \
    --tree-stats tree_stats.txt \
    --tree-file tree.nwk

Taxonomically equipped reference package example

The fun really begins when we incorporate taxonomic information. Say in addition to the above files you also have:

  • tax_ids.txt: A list of all the tax_ids that the sequences in your refpkg use, one per line in a text file. Blank lines and Python style comments in this file will be ignored, and repeated entries are only considered once.

  • seq_info.csv: A CSV file describing each of the aligned sequences. It begins with one line giving the field names:

    "seqname","accession","tax_id","species_name","is_type"
    

    This is followed by one line for each sequence in your multiple alignment. Only seqname and tax_id are required. seqname should match the ID of the sequence in the FASTA file and must be unique. accession provides an additional identifier for the sequence, which can (but need not) be the same as seqname. In our work, seqname is often an RDP accession number and accession is the NCBI accession number corresponding to that RDP entry (note that an NCBI accession number may not uniquely identify a sequence, as in tha case of a genomic sequence with multiple 16S rRNA gene loci).``tax_id`` is the entry in the taxonomy this sequence is mapped to, and species_name is the name associated with that entry. is_type indicates whether this sequence is from a type strain or not (should be TRUE or FALSE). If you aren’t dealing with typestrains, just set it to FALSE for all your sequences.

Now there are a few additional steps to construct a taxonomically-annotated refpkg.

First we have to represent the tax_ids as a taxonomy. You will need a copy of the NCBI taxonomy in a format taxtastic understands in order to extract the minimal taxonomy containing the tax_ids in tax_ids.txt. Run:

taxit new_database taxonomy.db

This will download zipped files containing the NCBI taxonomy into the current directory and load the data into an sqlite databse named taxonomy.db. This takes a while! Plan to go get a cup of coffee. Then run:

taxit taxtable taxonomy.db -f tax_ids.txt -o taxa.csv

The output, taxa.csv, is a CSV file containing the minimum subtaxonomy of taxonomy.db which contains all the entries in tax_ids.txt.

Now we are ready to build the refpkg:

taxit create -l locus_name -P my.refpkg \
    --taxonomy taxa.csv \
    --aln-fasta seqs.fasta \
    --seq-info seq_info.csv \
    --tree-stats tree_stats.txt \
    --tree-file tree.nwk

You may also want to add some metadata fields using the options --author, --description, and --package-version, but these are optional. In addition, you may package in profile information and a Stockholm alignment:

--aln-sto seqs.sto \
--profile align_profile

where align_profile is a profile that can be used to align query sequences using HMMER3 or Infernal, and seqs.sto is the reference alignment in Stockholm format. This is handy if you plan to use cmalign (in the Infernal suite of tools) or hmmalign (in the HMMER3 suite of tools) to perform your alignments.