API Documentation

class versioned_hdf5.VersionedHDF5File(f)

A Versioned HDF5 File

This is the main entry-point of the library. To use a versioned HDF5 file, pass a h5py file to constructor. The methods on the resulting object can be used to view and create versions.

Note that versioned HDF5 files have a special structure and should not be modified directly. Also note that once a version is created in the file, it should be treated as read-only. Some protections are in place to prevent accidental modification, but it is not possible in the HDF5 layer to make a dataset or group read-only, so modifications made outside of this library could result in breaking things.

>>> import h5py
>>> f = h5py.File('file.h5')
>>> from versioned_hdf5 import VersionedHDF5File
>>> file = VersionedHDF5File(f)

Access versions using indexing

>>> version1 = file['version1']

This returns a group containing the datasets for that version.

To create a new version, use stage_version().

>>> with file.stage_version('version2') as group:
...     group['dataset'] = ... # Modify the group
...

When the context manager exits, the version will be written to the file.

Finally, use

>>> file.close()

to close the VersionedHDF5File object (note that the h5py file object should be closed separately.)

close()

Make sure the VersionedHDF5File object is no longer reachable.

property current_version

The current version.

The current version is used as the default previous version to stage_version(), and is also used for negative integer version indexing (the current version is self[0]).

property data_version_identifier

Return the data version identifier.

Different versions of versioned-hdf5 handle data slightly differently. This string affects whether the version of versioned-hdf5 is compatible with the given file.

If no data version attribute is found, it is assumed to be 1.

Returns

str

The data version identifier string

get_diff(name, version1, version2)

Compute the difference between two versions of a dataset.

Parameters

namestr

Name of the dataset

version1str

First version to compare

version2str

Second version to compare

Returns

dict[tuple[slice, …], tuple[np.ndarray, …]]

A dictionary where the keys are slices that changed from version1 to version2, the the values are tuples containing

(data_in_version1, data_in_version2)

rebuild_hashtables()

Delete and rebuild all existing hashtables for the raw datasets.

rebuild_object_dtype_hashtables()

Find all dtype=’O’ data groups and rebuild their hashtables.

stage_version(version_name, prev_version=None, make_current=True, timestamp=None)

Return a context manager to stage a new version

The context manager returns a group, which should be modified in-place to build the new version. When the context manager exits, the new version will be written into the file.

version_name should be the name for the version.

prev_version should be the previous version which this version is based on. The group returned by the context manager will mirror this previous version. If it is None (the default), the previous version will be the current version. If it is '', there will be no previous version.

If make_current is True (the default), the new version will be set as the current version. The current version is used as the default prev_version for any future stage_version call.

timestamp may be a datetime.datetime or np.datetime64 timestamp for the version. Note that datetime.datetime timestamps must be in the UTC timezone (np.datetime64 timestamps are not timezone aware and are assumed to be UTC). If timestamp is None (the default) the current time when the context manager exits is used. When passing in a manual timestamp, be aware that no consistency checks are made to ensure that version timestamps are linear or not duplicated.

property versions

Return the names of the version groups in the file.

This should return the same as calling versioned_hdf5.versions.all_versions(self.f, include_first=False).

Returns

list[str]

The names of versions in the file; order is arbitrary

versioned_hdf5.delete_version(f, versions_to_delete)

Completely delete the given versions from a file

This function should be used instead of deleting the version group directly, as this will not delete the underlying data that is unique to the version.

versioned_hdf5.delete_versions(f, versions_to_delete)

Completely delete the given versions from a file

This function should be used instead of deleting the version group directly, as this will not delete the underlying data that is unique to the version.

versioned_hdf5.modify_metadata(f, dataset_name, *, chunks=None, dtype=None, fillvalue=None, compression=Default.DEFAULT, compression_opts=Default.DEFAULT, scaleoffset=Default.DEFAULT, shuffle=Default.DEFAULT, fletcher32=Default.DEFAULT)

Modify metadata for a versioned dataset in-place.

The metadata is modified for all versions containing a dataset.

f should be the h5py file or versioned_hdf5 VersionedHDF5File object.

dataset_name is the name of the dataset in the version group(s).

Metadata that may be modified are

  • chunks: must be compatible with the dataset shape

  • dtype: all data in the dataset is cast to the new dtype

  • fillvalue: see the note below

  • Filter settings (see h5py.Group.create_dataset()): - compression - compression_opts - scaleoffset - shuffle - fletcher32

If omitted, the given metadata is not modified.

Notes

For fillvalue, all values equal to the old fillvalue are updated to be the new fillvalue, regardless of whether they are explicitly stored or represented sparsely in the underlying HDF5 dataset. Also note that datasets without an explicitly set fillvalue have a default fillvalue equal to the default value of the dtype (e.g., 0. for float dtypes).

For filters, passing a value of None is not the same as omitting the argument. For example, compression=None will decompress a dataset if it was compressed, and compression_opts=None will revert to the default options for the compression plugin, whereas omitting them will retain the previous preferences.