The Python Refpkg API

Taxtastic provides a Python API for creating and manipulating refpkgs wihtout having to recourse to the command line. It is entirely based around the Refpkg class in taxtastic.refpkg. Thus all scripts dealing with refpkgs will include code that looks something like:

from taxtastic.refpkg import Refpkg

r = Refpkg('/path/to/refpkg')

The constructor takes no other arguments. If the refpkg at /path/to/refpkg already exists, r is attached to it. If /path/to/refpkg does not exist, a new, empty refpkg is created. If there is already something at /path/to/refpkg, but it is not a refpkg, Refpkg throws a ValueError.

The rest of the API is implemented as methods of the Refpkg class.

Note: This library is not thread safe!

Accessing refpkg contents

Refpkgs are primarily key-value stores for files and metadata. To find all the keys referring to files or to metadata in a given refpkg, use the methods

Refpkg.file_keys()[source]

Return a list of all the keys referring to files in this refpkg.

Refpkg.metadata_keys()[source]

Return a list of all the keys referring to metadata in this refpkg.

For example:

r = Refpkg('/path/to/new/refpkg')
r.metadata_keys()
 --> ['create_date', 'format_version']
r.update_file('file_key', '/path/to/some/file')
r.file_keys()
 --> ['file_key']

Call the metadata method to retrieve the value of a particular metadata field.

Refpkg.metadata(key)[source]

Return the metadata value associated to key.

For instance, on the same refpkg as above:

r.metadata('format_version')
 --> '1.1'

There are several methods for working with the resources in a reference package.

Refpkg.open_resource(resource, *mode)[source]

Return an open file object for a particular named resource in this reference package.

Refpkg.resource_name(resource)[source]

Return the name of the file within the reference package for a particular named resource.

Refpkg.resource_md5(resource)[source]

Return the stored MD5 sum for a particular named resource.

Refpkg.resource_path(resource)[source]

Return the path to the file within the reference package for a particular named resource.

Do not use this method if it is at all possible to use the open_resource method instead.

Checking refpkg integrity

There are two methods for checking a refpkg. The is_invalid method enforces only that the refpkg is sane: there is a CONTENTS.json file, the MD5 sums listed match the actual files, and other such basics. The is_ill_formed method is much stronger. It enforces the necessary structure of a refpkg to be fed to pplacer.

Refpkg.is_invalid()[source]

Check if this RefPkg is invalid.

Valid means that it contains a properly named manifest, and each of the files described in the manifest exists and has the proper MD5 hashsum.

If the Refpkg is valid, is_invalid returns False. Otherwise it returns a nonempty string describing the error.

Refpkg.is_ill_formed()[source]

Stronger set of checks than is_invalid for Refpkg.

Checks that FASTA, Stockholm, JSON, and CSV files under known keys are all valid as well as calling is_invalid. Returns either False or a string describing the error.

Updating and modifying refpkgs

All updates are made via two methods: update_metadata and update_file. Each takes a key and a new value to set at that key, and returns the path to the previous value of the key, or None if the key was not previously defined in the refpkg.

Refpkg.update_metadata(key, value)[source]

Set key in the metadata to value.

Returns the previous value of key, or None if the key was not previously set.

Refpkg.update_file(key, new_path)[source]

Insert file new_path into the refpkg under key.

The filename of new_path will be preserved in the refpkg unless it would conflict with a previously existing file, in which case a suffix is appended which makes it unique. The previous file, if there was one, is left in the refpkg. If you wish to delete it, see the strip method.

The full path to the previous file referred to by key is returned, or None if key was not previously defined in the refpkg.

Refpkg history, undo, and redo

Each operation performed on a refpkg leaves an entry in a log stored in CONTENTS.json. You can access this log by calling the log method.

Refpkg.log()[source]

Returns the log of this refpkg.

The log is a list of strings, one per operation, from newest to oldest.

Each operation call also be undone (and redone once undone). The undo stack is arbitrarily deep so all operations back to the previous call to strip (see below) can be undone. To undo an operation, call rollback. To redo it afterwards, call rollforward. The logs will similarly be updated to stay in sync with the rollback and rollforward of operations. Note that when you call another operation, all the redo information before that point is removed. You cannot undo an operation, perform another operation, then redo the first operation.

Refpkg.rollback()[source]

Revert the previous modification to the refpkg.

Refpkg.rollforward()[source]

Restore a reverted modification to the refpkg.

After performing a lot of operations on a refpkg, there will often be a long undo history, and files no longer referred to in the refpkg’s current state. To remove everything not relevant to the refpkg’s current state other than the log, call the strip method.

Refpkg.strip()[source]

Remove rollbacks, rollforwards, and all non-current files.

When distributing a refpkg, you probably want to distribute as small a one as possible. strip removes everything from the refpkg which is not relevant to its current state.

You can force a series of operations to be recorded as a single operation for rollback and have a single log entry by calling start_transaction before them, and commit_transaction with the log entry to record when they are done.

Refpkg.start_transaction()[source]

Begin a transaction to group operations on the refpkg.

All the operations until the next call to commit_transaction will be recorded as a single operation for rollback and rollforward, and recorded with a single line in the log.

Refpkg.commit_transaction(log=None)[source]

Commit a transaction, with log as the log entry.

For example,

from taxtastic.refpkg import Refpkg

r = Refpkg('/path/to/refpkg')

r.start_transaction()
r.update_metadata('author', 'Boris the mad baboon')
r.update_file('boris_signature', '/path/to/some/file')
r.commit_transaction("Left Boris's mark!")

would result in a single operation that could be rolled back as one, and leaves the log entry "Left Boris's mark!".

pplacer specific commands

Finally, the API has three commands which are specific to creating inputs for pplacer. One of these is check, which was described above. The other two are:

Refpkg.reroot(rppr=None, pretend=False)[source]

Reroot the phylogenetic tree.

This operation calls rppr reroot to generate the rerooted tree, so you must have pplacer and its auxiliary tools rppr and guppy installed for it to work. You can specify the path to rppr by giving it as the rppr argument.

If pretend is True, the convexification is run, but the refpkg is not actually updated.

Refpkg.update_phylo_model(stats_type, stats_file, frequency_type=None)[source]

Parse a stats log and use it to update phylo_model.

pplacer expects its input to include the deatils of the phylogenetic model used for creating a tree in JSON format under the key phylo_model, but no program actually outputs that format.

This function takes a log generated by RAxML or FastTree, parses it, and inserts an appropriate JSON file into the refpkg. The first parameter must be ‘RAxML’, ‘PhyML’ or ‘FastTree’, depending on which program generated the log. It may also be None to attempt to guess which program generated the log.

Parameters:
  • stats_type – Statistics file type. One of ‘RAxML’, ‘FastTree’, ‘PhyML’

  • stats_file – path to statistics/log file

  • frequency_type – For stats_type == 'PhyML', amino acid alignments only: was the alignment inferred with model or empirical frequencies?