The Python Refpkg API¶
Taxtastic provides a Python API for creating and manipulating refpkgs wihtout having to recourse to the command line. It is entirely based around the Refpkg
class in taxtastic.refpkg
. Thus all scripts dealing with refpkgs will include code that looks something like:
from taxtastic.refpkg import Refpkg
r = Refpkg('/path/to/refpkg')
The constructor takes no other arguments. If the refpkg at /path/to/refpkg
already exists, r
is attached to it. If /path/to/refpkg
does not exist, a new, empty refpkg is created. If there is already something at /path/to/refpkg
, but it is not a refpkg, Refpkg
throws a ValueError
.
The rest of the API is implemented as methods of the Refpkg
class.
Note: This library is not thread safe!
Accessing refpkg contents¶
Refpkgs are primarily key-value stores for files and metadata. To find all the keys referring to files or to metadata in a given refpkg, use the methods
For example:
r = Refpkg('/path/to/new/refpkg')
r.metadata_keys()
--> ['create_date', 'format_version']
r.update_file('file_key', '/path/to/some/file')
r.file_keys()
--> ['file_key']
Call the metadata
method to retrieve the value of a particular metadata field.
For instance, on the same refpkg as above:
r.metadata('format_version')
--> '1.1'
There are several methods for working with the resources in a reference package.
- Refpkg.open_resource(resource, *mode)[source]¶
Return an open file object for a particular named resource in this reference package.
Checking refpkg integrity¶
There are two methods for checking a refpkg. The is_invalid
method enforces only that the refpkg is sane: there is a CONTENTS.json
file, the MD5 sums listed match the actual files, and other such basics. The is_ill_formed
method is much stronger. It enforces the necessary structure of a refpkg to be fed to pplacer.
- Refpkg.is_invalid()[source]¶
Check if this RefPkg is invalid.
Valid means that it contains a properly named manifest, and each of the files described in the manifest exists and has the proper MD5 hashsum.
If the Refpkg is valid, is_invalid returns False. Otherwise it returns a nonempty string describing the error.
Updating and modifying refpkgs¶
All updates are made via two methods: update_metadata
and update_file
. Each takes a key and a new value to set at that key, and returns the path to the previous value of the key, or None
if the key was not previously defined in the refpkg.
- Refpkg.update_metadata(key, value)[source]¶
Set key in the metadata to value.
Returns the previous value of key, or None if the key was not previously set.
- Refpkg.update_file(key, new_path)[source]¶
Insert file new_path into the refpkg under key.
The filename of new_path will be preserved in the refpkg unless it would conflict with a previously existing file, in which case a suffix is appended which makes it unique. The previous file, if there was one, is left in the refpkg. If you wish to delete it, see the
strip
method.The full path to the previous file referred to by key is returned, or
None
if key was not previously defined in the refpkg.
Refpkg history, undo, and redo¶
Each operation performed on a refpkg leaves an entry in a log stored in CONTENTS.json
. You can access this log by calling the log
method.
- Refpkg.log()[source]¶
Returns the log of this refpkg.
The log is a list of strings, one per operation, from newest to oldest.
Each operation call also be undone (and redone once undone). The undo stack is arbitrarily deep so all operations back to the previous call to strip
(see below) can be undone. To undo an operation, call rollback
. To redo it afterwards, call rollforward
. The logs will similarly be updated to stay in sync with the rollback and rollforward of operations. Note that when you call another operation, all the redo information before that point is removed. You cannot undo an operation, perform another operation, then redo the first operation.
After performing a lot of operations on a refpkg, there will often be a long undo history, and files no longer referred to in the refpkg’s current state. To remove everything not relevant to the refpkg’s current state other than the log, call the strip
method.
- Refpkg.strip()[source]¶
Remove rollbacks, rollforwards, and all non-current files.
When distributing a refpkg, you probably want to distribute as small a one as possible. strip removes everything from the refpkg which is not relevant to its current state.
You can force a series of operations to be recorded as a single operation for rollback and have a single log entry by calling start_transaction
before them, and commit_transaction
with the log entry to record when they are done.
- Refpkg.start_transaction()[source]¶
Begin a transaction to group operations on the refpkg.
All the operations until the next call to
commit_transaction
will be recorded as a single operation for rollback and rollforward, and recorded with a single line in the log.
For example,
from taxtastic.refpkg import Refpkg
r = Refpkg('/path/to/refpkg')
r.start_transaction()
r.update_metadata('author', 'Boris the mad baboon')
r.update_file('boris_signature', '/path/to/some/file')
r.commit_transaction("Left Boris's mark!")
would result in a single operation that could be rolled back as one, and leaves the log entry "Left Boris's mark!"
.
pplacer
specific commands¶
Finally, the API has three commands which are specific to creating inputs for pplacer. One of these is check
, which was described above. The other two are:
- Refpkg.reroot(rppr=None, pretend=False)[source]¶
Reroot the phylogenetic tree.
This operation calls
rppr reroot
to generate the rerooted tree, so you must havepplacer
and its auxiliary toolsrppr
andguppy
installed for it to work. You can specify the path torppr
by giving it as the rppr argument.If pretend is
True
, the convexification is run, but the refpkg is not actually updated.
- Refpkg.update_phylo_model(stats_type, stats_file, frequency_type=None)[source]¶
Parse a stats log and use it to update
phylo_model
.pplacer
expects its input to include the deatils of the phylogenetic model used for creating a tree in JSON format under the keyphylo_model
, but no program actually outputs that format.This function takes a log generated by RAxML or FastTree, parses it, and inserts an appropriate JSON file into the refpkg. The first parameter must be ‘RAxML’, ‘PhyML’, ‘IQTREE’ or ‘FastTree’, depending on which program generated the log. It may also be None to attempt to guess which program generated the log.
- Parameters:
stats_type – Statistics file type. One of ‘RAxML’, ‘FastTree’, ‘PhyML’, ‘IQTREE’
stats_file – path to statistics/log/iqtree file
frequency_type – For
stats_type == 'PhyML'
, amino acid alignments only: was the alignment inferred withmodel
orempirical
frequencies?