Quickstart
==========


Minimal example
---------------

Say we have the following:

* ``seqs.fasta``: The multiply aligned reference sequences in FASTA format.
* ``tree.nwk``: The tree built from reference alignment, in Newick format.
* ``tree_stats.txt``: The log file from FastTree or the statistics file from RAxML/phyml.

Build a reference package ``my.refpkg`` for a locus ``locus_name`` (e.g. ``16s_rRNA``) in a single command as follows::

    taxit create -l 16s_rRNA -P my.refpkg \
        --aln-fasta seqs.fasta \
        --tree-stats tree_stats.txt \
        --tree-file tree.nwk


Taxonomically equipped reference package example
------------------------------------------------

The fun really begins when we incorporate taxonomic information.
Say in addition to the above files you also have:

* ``tax_ids.txt``: A list of all the tax_ids that the sequences in your refpkg use, one per line in a text file.  Blank lines and Python style comments in this file will be ignored, and repeated entries are only considered once.
* ``seq_info.csv``: A CSV file describing each of the aligned sequences.  It begins with one line giving the field names::

      "seqname","accession","tax_id","species_name","is_type"

  This is followed by one line for each sequence in your multiple alignment. Only ``seqname`` and ``tax_id`` are required. ``seqname`` should match the ID of the sequence in the FASTA file and must be unique. ``accession`` provides an additional identifier for the sequence, which can (but need not) be the same as ``seqname``.  In our work, ``seqname`` is often an RDP accession number and ``accession`` is the NCBI accession number corresponding to that RDP entry (note that an NCBI accession number may not uniquely identify a sequence, as in tha case of a genomic sequence with multiple 16S rRNA gene loci).``tax_id`` is the entry in the taxonomy this sequence is mapped to, and ``species_name`` is the name associated with that entry. ``is_type`` indicates whether this sequence is from a type strain or not (should be ``TRUE`` or ``FALSE``).  If you aren't dealing with typestrains, just set it to ``FALSE`` for all your sequences.

Now there are a few additional steps to construct a taxonomically-annotated refpkg.

First we have to represent the tax_ids as a taxonomy.  You will need a copy of the NCBI taxonomy in a format taxtastic understands in order to extract the minimal taxonomy containing the tax_ids in ``tax_ids.txt``.  Run::

    taxit new_database taxonomy.db

This will download zipped files containing the NCBI taxonomy into the current directory and load the data into an sqlite databse named ``taxonomy.db``.  This takes a while!  Plan to go get a cup of coffee.  Then run::

    taxit taxtable taxonomy.db -f tax_ids.txt -o taxa.csv

The output, ``taxa.csv``, is a CSV file containing the minimum subtaxonomy of ``taxonomy.db`` which contains all the entries in ``tax_ids.txt``.

Now we are ready to build the refpkg::

    taxit create -l locus_name -P my.refpkg \
        --taxonomy taxa.csv \
        --aln-fasta seqs.fasta \
        --seq-info seq_info.csv \
        --tree-stats tree_stats.txt \
        --tree-file tree.nwk

You may also want to add some metadata fields using the options ``--author``, ``--description``, and ``--package-version``, but these are optional. In addition, you may package in profile information and a Stockholm alignment::

        --aln-sto seqs.sto \
        --profile align_profile

where ``align_profile`` is a profile that can be used to align query sequences using HMMER3 or Infernal, and ``seqs.sto`` is the reference alignment in Stockholm format. This is handy if you plan to use ``cmalign`` (in the Infernal suite of tools) or ``hmmalign`` (in the HMMER3 suite of tools) to perform your alignments.