API Documentation


Public API functions

Everything outside of this file is considered internal API and is subject to change.

class versioned_hdf5.api.VersionedHDF5File(f)

A Versioned HDF5 File

This is the main entry-point of the library. To use a versioned HDF5 file, pass a h5py file to constructor. The methods on the resulting object can be used to view and create versions.

Note that versioned HDF5 files have a special structure and should not be modified directly. Also note that once a version is created in the file, it should be treated as read-only. Some protections are in place to prevent accidental modification, but it is not possible in the HDF5 layer to make a dataset or group read-only, so modifications made outside of this library could result in breaking things.

>>> import h5py
>>> f = h5py.File('file.h5')
>>> from versioned_hdf5 import VersionedHDF5File
>>> file = VersionedHDF5File(f)

Access versions using indexing

>>> version1 = file['version1']

This returns a group containing the datasets for that version.

To create a new version, use stage_version().

>>> with file.stage_version('version2') as group:
...     group['dataset'] = ... # Modify the group

When the context manager exits, the version will be written to the file.

Finally, use

>>> file.close()

to close the VersionedHDF5File object (note that the h5py file object should be closed separately.)


Make sure the VersionedHDF5File object is no longer reachable.

property current_version

The current version.

The current version is used as the default previous version to stage_version(), and is also used for negative integer version indexing (the current version is self[0]).

stage_version(version_name: str, prev_version=None, make_current=True, timestamp=None)

Return a context manager to stage a new version

The context manager returns a group, which should be modified in-place to build the new version. When the context manager exits, the new version will be written into the file.

version_name should be the name for the version.

prev_version should be the previous version which this version is based on. The group returned by the context manager will mirror this previous version. If it is None (the default), the previous version will be the current version. If it is '', there will be no previous version.

If make_current is True (the default), the new version will be set as the current version. The current version is used as the default prev_version for any future stage_version call.

timestamp may be a datetime.datetime or np.datetime64 timestamp for the version. Note that datetime.datetime timestamps must be in the UTC timezone (np.datetime64 timestamps are not timezone aware and are assumed to be UTC). If timestamp is None (the default) the current time when the context manager exits is used. When passing in a manual timestamp, be aware that no consistency checks are made to ensure that version timestamps are linear or not duplicated.

Version Replaying

The functions in this module allow replaying versions in a file in-place, in order to globally modify metadata across all versions that otherwise cannot be changed across versions, such as the dtype of a dataset. This also allows editing data in old versions, and deleting datasets or versions.

versioned_hdf5.replay.delete_version(f, versions_to_delete)

Completely delete the versions from versions_to_delete from the versioned file f.

This function should be used instead of deleting the version group directly, as this will not delete the underlying data that is unique to the version.

versioned_hdf5.replay.delete_versions(f, versions_to_delete)

Completely delete the versions from versions_to_delete from the versioned file f.

This function should be used instead of deleting the version group directly, as this will not delete the underlying data that is unique to the version.

versioned_hdf5.replay.modify_metadata(f, dataset_name, *, chunks=None, compression=None, compression_opts=None, dtype=None, fillvalue=None)

Modify metadata for a versioned dataset in-place.

The metadata is modified for all versions containing a dataset.

f should be the h5py file or versioned_hdf5 VersionedHDF5File object.

dataset_name is the name of the dataset in the version group(s).

Metadata that may be modified are

  • chunks: must be compatible with the dataset shape

  • compression: see h5py.Group.create_dataset()

  • compression_opts: see h5py.Group.create_dataset()

  • dtype: all data in the dataset is cast to the new dtype

  • fillvalue: see the note below

If set to None (the default), the given metadata is not modified.

Note for fillvalue, all values equal to the old fillvalue are updated to be the new fillvalue, regardless of whether they are explicitly stored or represented sparsely in the underlying HDF5 dataset. Also note that datasets without an explicitly set fillvalue have a default fillvalue equal to the default value of the dtype (e.g., 0. for float dtypes).

versioned_hdf5.replay.recreate_dataset(f, name, newf, callback=None)

Recreate dataset from all versions into newf

newf should be a versioned hdf5 file/group that is already initialized (it may or may not be in the same physical file as f). Typically newf should be tmp_group(f) (see tmp_group()).

callback should be a function with the signature

callback(dataset, version_name)

It will be called on every dataset in every version. It should return the dataset to be used for the new version. The dataset and its containing group should not be modified in-place. If a new copy of a dataset is to be used, it should be one of the dataset classes in versioned_hdf5.wrappers, and should placed in a temporary group, which you may delete after recreate_dataset() is done. The callback may also return None, in which case the dataset is deleted for the given version.

Note: this function is only for advanced usage. Typical use-cases should use delete_version() or modify_metadata().

versioned_hdf5.replay.swap(old, new)

Swap every dataset in old with the corresponding one in new

Datasets in old that aren’t in new are ignored.


Create a temporary group in f for use with recreate_dataset().