Detailed taxit
documentation¶
This section gives the detailed documentation on taxit
’s subcommands, organized alphabetically.
add_nodes¶
usage: taxit add_nodes [-h] [--schema SCHEMA] [--source-name SOURCE_NAME]
[url] FILE
Add nodes and names to a database
The input file specifies new nodes (type: node) and names (type: name)
in yaml format (see
http://fhcrc.github.io/taxtastic/commands.html#add-nodes).
positional arguments:
url Database string URI or filename. If no database scheme
specified "sqlite:///" will be prepended.
[sqlite:///ncbi_taxonomy.db]
FILE yaml file specifying new nodes
options:
-h, --help show this help message and exit
--source-name SOURCE_NAME
Provides the default source name for new nodes. The
value is overridden by "source_name" in the input
file. If not provided, "source_name" is required in
each node or name definition. This source name is
created if it does not exist.
database options:
--schema SCHEMA Name of SQL schema in database to query (if database
flavor supports this).
Add nodes or names to the taxonomy in the specified database.
new_nodes
should be a yaml format file containing one or more
records, each of which specifies a new node or name.
For a new node the following are required:
type
The value must be “node”
tax_id
The tax_id for this new node in the taxonomy, which must not conflict with an existing tax_id. NCBI’s tax_ids are all integers, so it works well to choose an alphabetic prefix for the tax_ids for all new nodes (e.g., name them AB1, AB2, AB3, etc.).
rank
The name of the rank at which this node falls in the taxonomy. Choose from among the ranks specified in table
ranks
.parent_id
The tax_id which will be set as the parent of this node in the taxonomy.
names
One or more names to associate with the node. Minimally, must define a single taxonomic name. See description of a name record below.
Required unless a default source name is specified using the
--source-name
option:
source_name
A string describing the origin of these taxa so that it is easy to find them in the database.
Any combination of the following columns may be specified:
children
A list of tax_ids which should be detached from their current parents and attached to this node as its children.
A minimal example of a record specifying a node (assuming
--source-name
is provided on the command line):
---
type: node
tax_id: "newid"
parent_id: "1279"
rank: species_group
names:
- tax_name: between genus and species
A record providing source_name
, multiple taxonomic names, plus
child nodes:
---
type: node
tax_id: "newid"
parent_id: "1279"
rank: species_group
names:
- tax_name: between genus and species
is_primary: true
- tax_name: another name
source_name: someplace
children:
- "1280" # Staphylococcus aureus
- "1281" # Staphylococcus carnosus
A record specifying names to be added to existing nodes has the following required fields:
type
The value must be “name”
tax_id
The tax_id to add names to.
names
A list of taxonomic names. If a single name is provided, requires only
tax_name
; if more than one, the primary name must be indicated (see examples below).
A minimal example (again, assuming source_name
is defined from the command line):
---
type: name
tax_id: bar
names:
- tax_name: a new name for bar
If there are multiple names:
---
type: name
tax_id: bar
names:
- tax_name: a new name for bar
is_primary: true
- tax_name: another name
Multiple records are delimited by ---
and may contain any
combination of names and nodes:
---
type: node
tax_id: "newid"
parent_id: "1279"
rank: species_group
names:
- tax_name: between genus and species
---
type: name
tax_id: bar
names:
- tax_name: a new name for bar
Note that the nodes and names are added to the database in the order specified; be sure to add parent nodes before children.
add_to_taxtable¶
usage: taxit add_to_taxtable [-h] [-o CSV] CSV CSV
Add nodes to an existing taxtable csv
positional arguments:
CSV A taxtable to augment
CSV A CSV file containing nodes to add to taxtable. Must
contain columns 'tax_id', 'tax_name', 'rank', and
'parent_id'. Each record must have a parent_id already in
the taxtable, or defined on an earlier row.
options:
-h, --help show this help message and exit
-o CSV, --out CSV Destination for output taxtable [default: stdout]
check¶
usage: taxit check [-h] REFPKG
Validate a reference package
Checks whether ``REFPKG`` is a valid input for ``pplacer``, that is,
does it have a FASTA file of the reference sequences; a Stockholm file
of their multiple alignment; a Newick formatted tree build from the
aligned sequences; and all the necessary auxiliary information.
positional arguments:
REFPKG Path to Refpkg to check
options:
-h, --help show this help message and exit
composition¶
usage: taxit composition [-h] [-t csv file] [-i csv file] [-r RANK] [-o OUT]
[refpkg]
Show taxonomic composition of a reference package
positional arguments:
refpkg the reference package to operate on
options:
-h, --help show this help message and exit
-t csv file, --taxonomy csv file
Path to taxtable (ignored if refpkg is provided,
required otherwise)
-i csv file, --seq_info csv file
Path to seq_info (ignored if refpkg is provided,
required otherwise)
-r RANK, --rank RANK show composition at RANK [species]
-o OUT, --out OUT rank at which to show composition. Use --rank=tax_id
to show original classifications [stdout]
create¶
usage: taxit create [-h] [-c] -P PATH -l LOCUS [-a NAME] [-d TEXT]
[-r VERSION] [-f FILE] [-i file] [-m FILE] [-M FILE]
[-p FILE] [-R FILE] [-s FILE] [-S FILE] [-t FILE]
[-T FILE] [--stats-type {PhyML,FastTree,RAxML}]
[--frequency-type {empirical,model}] [--no-reroot]
[--rppr RPPR]
Create a reference package
Create a new refpkg at the location specified by the argument to
``-P`` with locus name ``-l``. All other fields are used to specify
initial metadata and files to add to the refpkg. If there is already
a refpkg at ``refpkg``, this command will fail unless you specify
``-c`` or ``--clobber``.
options:
-h, --help show this help message and exit
-c, --clobber Delete an existing reference package.
Required arguments:
-P PATH, --package-name PATH
Name of refpkg to create
-l LOCUS, --locus LOCUS
The locus described by the reference package
Package Metadata:
-a NAME, --author NAME
Person who created the reference package
-d TEXT, --description TEXT
An arbitrary description field
-r VERSION, --package-version VERSION
Release version for the reference package
Input files:
-f FILE, --aln-fasta FILE
Multiple alignment in fasta format
-i file, --seq-info file
CSV format file describing the aligned reference
sequences, minimally containing the fields "seqname"
and "tax_id"
-m FILE, --mask FILE Text file containing a mask
-M FILE, --model-file FILE
File containing model information usually the
.bestModel file
-p FILE, --profile FILE
Alignment profile
-R FILE, --readme FILE
README file describing the reference package
-s FILE, --tree-stats FILE
File containing tree statistics (for example
RAxML_info.whatever")
-S FILE, --aln-sto FILE
Multiple alignment in Stockholm format
-t FILE, --tree-file FILE
Phylogenetic tree in newick format
-T FILE, --taxonomy FILE
CSV format file defining the taxonomy. Fields include
"tax_id","parent_id","rank","tax_name" followed by a
column defining tax_id at each rank starting with root
Tree information:
--stats-type {PhyML,FastTree,RAxML}
stats file type [default: attempt to guess from file
contents]
--frequency-type {empirical,model}
Residue frequency type from the model. Required for
var in collection: PhyML Amino Acid alignments.
Taxonomic Rerooting:
--no-reroot Do not reroot the reference package using `rppr
reroot`. [default: reroot if `rppr` is available and a
taxonomy file is specified]
--rppr RPPR Name of the rppr executable. [default: rppr]
Input files
Input files are identified in the refpkg using the following labels
(see, for example taxit rp
):
Option |
File key |
Description |
---|---|---|
|
|
Reference sequences in FASTA format |
|
|
CSV describing aligned sequences |
|
|
Text file containing sequence mask |
|
|
Multiple alignment profile |
|
|
A README file for the refpkg |
|
|
Typically written by the tree builder |
|
|
Stockholm file of reference sequences |
|
|
Phylogenetic tree in Newick format |
|
|
CSV file specifying taxonomy |
Examples:
# Create a minimal refpkg
taxit create -P my_refpkg -l "Some locus name"
# Create a refpkg with lots of files in it
taxit create -P another_refpkg -l "Another locus" \
--author "Boris the mad baboon" --package-version 0.3.1 \
--aln-fasta seqs.fasta --aln-sto seqs.sto \
--tree-file seqs.newick --seq-info seqs.csv \
--profile cmalign.profile --tree-stats RAxML.info \
--taxonomy taxtable.csv
findcompany¶
usage: taxit findcompany [-h] [-c] [-i INPUT] [-o OUT] taxdb [tax_ids ...]
Find company for lonely nodes
A command meant to follow ``lonelynodes``. Given a list of tax_ids
produced by ``taxit lonelynodes``, produces another list of species
tax_ids that can be added to the taxtable that would render those
tax_ids no longer lonely.
positional arguments:
taxdb Taxonomy database to work from
tax_ids Tax IDs to look up
options:
-h, --help show this help message and exit
-c, --cut Produce only one output tax_id per input tax_id,
whether or not the output species would themselves be
lonely.
-i INPUT, --input INPUT
Text file to read Tax IDs from, one per line
-o OUT, --out OUT Output file for new taxids
Examples:
taxit findcompany taxonomy.db -i taxids.txt -o newtaxids.txt
taxit findcompany taxonomy.db 31661 5213 564
info¶
usage: taxit info [-h] [-n] [-t] [-l] refpkg
Show information about reference packages.
positional arguments:
refpkg the reference package to operate on
options:
-h, --help show this help message and exit
-n, --seq-names print a list of sequence names
-t, --tally print a tally of sequences representing each taxon at rank
RANK
-l, --lengths print sequence lengths
lineage_table¶
usage: taxit lineage_table [-h] [--seqname-col NAME] [--tax-id-col NAME]
[-c FILE] [-t FILE]
FILE FILE
Create a table of lineages as taxonimic names for a collection of sequences
Minimal inputs are a taxtable and a file providing a mapping of
sequence names to tax_ids. Outputs are one or more of:
* a table of taxonomic lineges in csv format
* a "taxonomy" file formatted for MOTHUR
(https://mothur.org/wiki/Taxonomy_File). Ranks are limited to the
following, with corresponding abbreviations:
('species', 's'),
('genus', 'g'),
('family', 'f'),
('order', 'o'),
('class', 'c'),
('phylum', 'p'),
('superkingdom', 'k'),
Lineages are truncated to either the most specific defined rank or
species, and missing tax_names at a given rank are replaced with the
tax_name of the parent, eg
"...;f__something;g__<None>;s__whatever"
would become
"...;f__something;g__something_unclassified;s__whatever"
options:
-h, --help show this help message and exit
input options:
FILE output of "taxit taxtable" containing all tax_ids
represented in "seq_info"
FILE csv file providing a mapping of sequence names to
tax_ids
--seqname-col NAME name of column in "seq_info" containing sequence names
--tax-id-col NAME name of column in "seq_info" containing tax_ids
Output options:
-c FILE, --csv-table FILE
Output file containing lineages for each sequence name
in csv format
-t FILE, --taxonomy-table FILE
"taxonomy" file formatted for MOTHUR
Examples:
taxit taxtable taxonomy.db -i seq_info.csv -o taxtable.csv
taxit lineage_table taxtable.csv seq_info.csv \
--csv-table taxonomy.csv --taxonomy-table taxonomy.txt
taxonomy.txt
looks like this:
s1 "pk__Bacteria";"ph__Firmicutes";"cl__Bacilli";"or__Bacillales";"fa__Staphylococcaceae";"ge__Staphylococcus";"sp__Staphylococcus aureus"
s2 "pk__Bacteria";"ph__Firmicutes";"cl__Bacilli";"or__Bacillales";"fa__Staphylococcaceae";"ge__Staphylococcus";"sp__Staphylococcus equorum"
s3 "pk__Bacteria";"ph__Firmicutes";"cl__Bacilli";"or__Bacillales";"fa__Staphylococcaceae";"ge__Staphylococcus";"sp__Staphylococcus equorum"
s4 "pk__Bacteria";"ph__Firmicutes";"cl__Bacilli";"or__Bacillales";"fa__Staphylococcaceae";"ge__Staphylococcus";"sp__unclassified"
lonelynodes¶
usage: taxit lonelynodes [-h] [-o OUT] [-r RANKS] taxtable_or_refpkg
Extracts tax ids of all lonely nodes in a taxtable
Find nodes in ``target`` (which can be a CSV file extracted by ``taxit
taxtable`` or a RefPkg containing such a file) which are lonely; that
is, whose parents have only one child. Print them, one per line, to
``stdout`` or to the file specified by the ``-o`` option.
positional arguments:
taxtable_or_refpkg A taxtable or a refpkg containing a taxtable
options:
-h, --help show this help message and exit
-o OUT, --out OUT Write output to given file [default: stdout]
-r RANKS, --ranks RANKS
Comma separated list of ranks to consider [default:
all ranks]
Examples:
# Find lonely nodes in RefPkg mypkg-0.1.refpkg
taxit lonelynodes mypkg-0.1.refpkg
new_database¶
usage: taxit new_database [-h] [--schema SCHEMA] [--no-clobber] [-n]
[-z FILE.zip] [-u URL] [-p PATH] [--out OUT]
[url]
Download NCBI taxonomy and create a database
Download the current version of the NCBI taxonomy and load it into
``database_file`` as an SQLite3 database. If ``database_file``
already exists it be will overwritten unless you specify ``--no-clobber``.
The NCBI taxonomy will be downloaded into
the same directory as ``database_file`` will be created in unless you
specify ``-p`` or ``--download-dir``.
positional arguments:
url Database string URI or filename. If no database scheme
specified "sqlite:///" will be prepended.
[sqlite:///ncbi_taxonomy.db]
options:
-h, --help show this help message and exit
--no-clobber If database exists keep current data and append new
data. [False]
-n, --no-load Create schema and exit
--out OUT table sql
database options:
--schema SCHEMA Name of SQL schema in database to query (if database
flavor supports this).
download options:
-z FILE.zip, --taxdump-file FILE.zip
Location of zipped taxdump file [taxdmp.zip]
-u URL, --taxdump-url URL
Url to taxdump file
[https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdmp.zip]
-p PATH, --download-dir PATH
Name of the directory into which to download the zip
archive. [default is the same directory as the
database file]
Examples:
Download the NCBI taxonomy and create taxonomy.db if it does not exist:
taxit new_database taxonomy.dbForce the creation of taxonomy.db in the parent directory, putting the downloaded NCBI data in /tmp/ncbi:
taxit new_database ../taxonomy.db -p /tmp/ncbi
refpkg_intersection¶
usage: taxit refpkg_intersection [-h] -c REFPKG -r RANKS [--all-ranks]
[-o OUT]
infile
Find the intersection of a taxtable and a refpkg's taxonomy.
positional arguments:
infile taxtable to compare against
options:
-h, --help show this help message and exit
-c REFPKG, --refpkg REFPKG
refpkg to insert into
-r RANKS, --ranks RANKS
ranks to list in the output
--all-ranks don't filter by the lowest rank; list all
intersections
-o OUT, --out OUT output file in csv format (default is stdout)
reroot¶
usage: taxit reroot [-h] [--rppr RPPR] [-p] refpkg
Taxonomically reroots a reference package
Calls ``rppr reroot`` to generate a rerooted tree from the tree in
``refpkg`` and writes it back to the refpkg. The refpkg ``refpkg``
must contain the necessary inputs for ``pplacer`` for this to work.
positional arguments:
refpkg the reference package to operate on
options:
-h, --help show this help message and exit
--rppr RPPR specify the rppr binary to call to perform the rerooting
-p, --pretend don't save the rerooted tree; just attempt the rerooting.
Examples:
Reroot the tree in my_refpkg:
taxit reroot my_refpkg
Try running reroot without modifying the refpkg, using a particular version of rppr:
taxit reroot --rppr ~/local/bin/rppr -p my_refpkg
rollback¶
usage: taxit rollback [-h] [-n int] refpkg
Undo an operation performed on a refpkg
Rollback ``N`` operations on ``refpkg`` (default to 1 operation if
``-n`` is omitted). This is equivalent to calling the ``rollback()``
method of ``taxtastic.refpkg.Refpkg``. If there are not at least
``N`` operations that can be rolled back, an error is returned and no
changes are made to the refpkg.
positional arguments:
refpkg the reference package to operate on
options:
-h, --help show this help message and exit
-n int Number of operations to roll back
Examples:
Update the author on my_refpkg, then revert the change:
taxit update --metadata 'author=Boris the mad baboon'
taxit rollback my_refpkg
Roll back the last 3 operations on my_refpkg:
taxit rollback -n 3 my_refpkg
rollforward¶
usage: taxit rollforward [-h] [-n int] refpkg
Restore a change to a refpkg immediately after being reverted
Restore the last ``N`` rolled back operations on ``refpkg``, or the
last operation if ``-n`` is omitted. If there are not at least ``N``
operations that can be rolled forward on this refpkg, then an error is
returned and no changes are made to the refpkg.
Note that operations can only be rolled forward immediately after
being rolled back. If any operation besides a rollback occurs, all
roll forward information is removed.
positional arguments:
refpkg the reference package to operate on
options:
-h, --help show this help message and exit
-n int Number of operations to roll back
Examples:
Roll back the last operation on my_refpkg, then restore it:
taxit rollback my_refpkg
taxit rollforward my_refpkg
Roll forward the last 3 rollbacks on my_refpkg:
taxit rollforward -n 3 my_refpkg
rp (resolve path)¶
usage: taxit rp [-h] refpkg KEY
Resolve path; get the path to a file in the reference package
See online documentation for ``taxit create`` for a list of
permissible values for ``KEY``
For example, write the absolute path to the file containing the
phylogenetic tree in ``my.refpkg`` to stdout::
taxit rp my.refpkg tree
Examine the contents of the seq_info file::
less $(taxit rp my.refpkg seq_info)
positional arguments:
refpkg the reference package to operate on
KEY show the path for file identified by KEY
options:
-h, --help show this help message and exit
strip¶
usage: taxit strip [-h] refpkg
Remove rollback and rollforward information from a refpkg
Delete everything in the refpkg not relevant to the current state,
including all files no longer referred to, as well as all rollback and
rollforward information. The log is preserved, with a new entry
entered indicating that ``refpkg`` was stripped.
positional arguments:
refpkg the reference package to operate on
options:
-h, --help show this help message and exit
Examples:
Perform an update:
taxit update my_refpkg hilda=file1
After this, file1 is still in the refpkg, but not referred to except by the rollback information:
taxit update my_refpkg hilda=file2
Now strip
deletes file1, and the rollback and rollforward information:
taxit strip my_refpkg
taxids¶
usage: taxit taxids [-h] [--schema SCHEMA] [-f FILE | -n NAMES] [-o FILE]
[url]
Convert a list of taxonomic names into a recursive list of species
leve tax_ids.
``The names to convert can be specified in a text file with one name
per line (the ``-f`` or ``--name-file`` options) or on the command
line as a comma delimited list (the ``-n`` of ``--name`` options).
positional arguments:
url Database string URI or filename. If no database scheme
specified "sqlite:///" will be prepended.
[sqlite:///ncbi_taxonomy.db]
options:
-h, --help show this help message and exit
database options:
--schema SCHEMA Name of SQL schema in database to query (if database
flavor supports this).
Input options:
-f FILE, --name-file FILE
file containing a list of taxonomic names, one per
line
-n NAMES, --name NAMES
list of taxonomic names provided as a comma-delimited
list on the command line
Output options:
-o FILE, --out FILE output file
Examples:
Look up two species and print their tax_ids to stdout, one per line:
taxit taxids ncbi_database.db -n "Lactobacillus crispatus,Lactobacillus helveticus"
Read the species from some_names.txt and write their tax_ids to some_taxids.txt:
taxit taxids ncbi_database.db -f some_names.txt -o some_taxids.txt
taxtable¶
usage: taxit taxtable [-h] [--schema SCHEMA] [-t TAX_IDS [TAX_IDS ...]]
[-f FILE] [-i SEQ_INFO] [-a {error,warn}] [-o FILE]
[url]
Create a tabular representation of taxonomic lineages
Write a CSV file containing the minimal subset of the taxonomy in
``database_file`` representing all of the lineages specified by the
provided tax_ids. Duplicate tax_ids are ignored.
By default the CSV is written to ``stdout``, unless a file is
specified with ``-o/--outfile``.
positional arguments:
url Database string URI or filename. If no database scheme
specified "sqlite:///" will be prepended.
[sqlite:///ncbi_taxonomy.db]
options:
-h, --help show this help message and exit
-a {error,warn}, --unknown-action {error,warn}
action to perform for tax_ids not present in database
[error]
database options:
--schema SCHEMA Name of SQL schema in database to query (if database
flavor supports this).
input options:
-t TAX_IDS [TAX_IDS ...], --tax-ids TAX_IDS [TAX_IDS ...]
one or more space-delimited tax_ids (eg "-t 47770
33945")
-f FILE, --tax-id-file FILE
File containing a whitespace-delimited list of tax_ids
(ie, separated by tabs, spaces, or newlines.
-i SEQ_INFO, --seq-info SEQ_INFO
Read tax_ids from sequence info file, minimally
containing a column named "tax_id"
Output options:
-o FILE, --outfile FILE
Output file containing lineages for the specified taxa
in csv format; writes to stdout if unspecified
Examples:
Extract tax_ids 47770 and 33945 and all nodes connecting them to the root.:
taxit taxtable taxonomy.db -t 47770,33945
The same as above, but write the output to subtax.csv instead of stdout:
taxit taxtable taxonomy.db -t 47770,33945 -o subtax.csv
Extract the same tax_ids, plus the taxa specifies in taxnames.txt:
taxit taxtable taxonomy.db -t 47770,33945 -n taxnames.txt -o taxonomy_from_both.csv
update¶
usage: taxit update [-h] [--metadata]
[--stats-type {PhyML,FastTree,RAxML,IQTREE}]
[--frequency-type {empirical,model}]
refpkg [key=value ...]
Add or modify files or metadata in a refpkg
Update ``refpkg`` to set ``key`` to ``some value``. If ``--metadata``
is specified, the update is done to the metadata. Otherwise ``some
value`` is treated as the path to a file, and that file is updated in
``refpkg``. An arbitrary of "key=value" pairs can be specified on the
command line. If the same key is specified twice, the later
occurrence dominates.
All updates specified to an instance of this command are run as a
single operation, and will all be undone by a single rollback.
For example::
taxit update my-refpkg meep=../otherdir/boris hilda=abcd
If a file already exists under a given key, it is overwritten.
The --metadata option causes a change to the metadata instead of
files. For example, to set the author field to "Genghis Khan" and the
version to 0.4.3::
taxit update --metadata "author=Genghis Khan" version=0.4.3
Other examples:
Set the author in my_refpkg::
taxit update my_refpkg --metadata "author=Boris the mad baboon"
Set the author and version at once::
taxit update my_refpkg --metadata "author=Bill" "package_version=1.7.2"
Insert a file into the refpkg::
taxit update my_refpkg "aln_fasta=/path/to/a/file.fasta"
positional arguments:
refpkg the reference package to operate on
key=value keys to update, in key=some_file format
options:
-h, --help show this help message and exit
--metadata Update metadata instead of files
Tree inference log file parsing (for updating `tree_stats`):
--stats-type {PhyML,FastTree,RAxML,IQTREE}
stats file type [default: attempt to guess from file
contents]
--frequency-type {empirical,model}
Residue frequency type from the model. Required for
PhyML Amino Acid alignments.
update_taxids¶
usage: taxit update_taxids [-h] [--schema SCHEMA] [--delimiter]
[--taxid-column] [--unknowns]
[-a {drop,ignore,error}] [-o]
infile [url]
Update obsolete tax_ids
Replaces tax_ids as specified in table 'merged' in the taxonomy
database. Use in preparation for ``taxit taxtable``. Takes sequence
info file as passed to ``taxit create --seq-info``
positional arguments:
infile Input file with taxids. Use "-" for stdin.
url Database string URI or filename. If no database scheme
specified "sqlite:///" will be prepended.
[sqlite:///ncbi_taxonomy.db]
options:
-h, --help show this help message and exit
--delimiter Infile columns delimiter [,]
--taxid-column name of column or index if headerless containing
tax_ids to be replaced [tax_id]
--unknowns optional output file containing rows with unknown
tax_ids having no replacements in merged table
-a {drop,ignore,error}, --unknown-action {drop,ignore,error}
action to perform for tax_ids with no replacement in
merged table [error]
-o , --outfile Modified version of input file [stdout]
database options:
--schema SCHEMA Name of SQL schema in database to query (if database
flavor supports this).