API Documentation

VersionedHDF5File

Public API functions

Everything outside of this file is considered internal API and is subject to change.

class versioned_hdf5.api.VersionedHDF5File(f)

A Versioned HDF5 File

This is the main entry-point of the library. To use a versioned HDF5 file, pass a h5py file to constructor. The methods on the resulting object can be used to view and create versions.

Note that versioned HDF5 files have a special structure and should not be modified directly. Also note that once a version is created in the file, it should be treated as read-only. Some protections are in place to prevent accidental modification, but it is not possible in the HDF5 layer to make a dataset or group read-only, so modifications made outside of this library could result in breaking things.

>>> import h5py
>>> f = h5py.File('file.h5') 
>>> from versioned_hdf5 import VersionedHDF5File
>>> file = VersionedHDF5File(f) 

Access versions using indexing

>>> version1 = file['version1'] 

This returns a group containing the datasets for that version.

To create a new version, use stage_version().

>>> with file.stage_version('version2') as group: 
...     group['dataset'] = ... # Modify the group
...

When the context manager exits, the version will be written to the file.

Finally, use

>>> file.close() 

to close the VersionedHDF5File object (note that the h5py file object should be closed separately.)

close()

Make sure the VersionedHDF5File object is no longer reachable.

property current_version

The current version.

The current version is used as the default previous version to stage_version(), and is also used for negative integer version indexing (the current version is self[0]).

property data_version_identifier: str

Return the data version identifier.

Different versions of versioned-hdf5 handle data slightly differently. This string affects whether the version of versioned-hdf5 is compatible with the given file.

If no data version attribute is found, it is assumed to be 1.

Returns

str

The data version identifier string

get_diff(name: str, version1: str, version2: str) Dict[Tuple[slice], Tuple[ndarray]]

Compute the difference between two versions of a dataset.

Parameters

namestr

Name of the dataset

version1str

First version to compare

version2str

Second version to compare

Returns

Dict[Tuple[slice], Tuple[np.ndarray]]

A dictionary where the keys are slices that changed from version1 to version2, the the values are tuples containing

(data_in_version1, data_in_version2)

rebuild_hashtables()

Delete and rebuild all existing hashtables for the raw datasets.

rebuild_object_dtype_hashtables()

Find all dtype=’O’ data groups and rebuild their hashtables.

stage_version(version_name: str, prev_version=None, make_current=True, timestamp=None)

Return a context manager to stage a new version

The context manager returns a group, which should be modified in-place to build the new version. When the context manager exits, the new version will be written into the file.

version_name should be the name for the version.

prev_version should be the previous version which this version is based on. The group returned by the context manager will mirror this previous version. If it is None (the default), the previous version will be the current version. If it is '', there will be no previous version.

If make_current is True (the default), the new version will be set as the current version. The current version is used as the default prev_version for any future stage_version call.

timestamp may be a datetime.datetime or np.datetime64 timestamp for the version. Note that datetime.datetime timestamps must be in the UTC timezone (np.datetime64 timestamps are not timezone aware and are assumed to be UTC). If timestamp is None (the default) the current time when the context manager exits is used. When passing in a manual timestamp, be aware that no consistency checks are made to ensure that version timestamps are linear or not duplicated.

property versions: List[str]

Return the names of the version groups in the file.

This should return the same as calling versioned_hdf5.versions.all_versions(self.f, include_first=False).

Returns

List[str]

The names of versions in the file; order is arbitrary

Version Replaying

The functions in this module allow replaying versions in a file in-place, in order to globally modify metadata across all versions that otherwise cannot be changed across versions, such as the dtype of a dataset. This also allows editing data in old versions, and deleting datasets or versions.

versioned_hdf5.replay.delete_version(f: VersionedHDF5File | h5py.File, versions_to_delete: str | Iterable[str])

Completely delete the given versions from a file

This function should be used instead of deleting the version group directly, as this will not delete the underlying data that is unique to the version.

versioned_hdf5.replay.delete_versions(f: VersionedHDF5File | h5py.File, versions_to_delete: str | Iterable[str])

Completely delete the given versions from a file

This function should be used instead of deleting the version group directly, as this will not delete the underlying data that is unique to the version.

versioned_hdf5.replay.modify_metadata(f, dataset_name, *, chunks=None, compression=None, compression_opts=None, dtype=None, fillvalue=None)

Modify metadata for a versioned dataset in-place.

The metadata is modified for all versions containing a dataset.

f should be the h5py file or versioned_hdf5 VersionedHDF5File object.

dataset_name is the name of the dataset in the version group(s).

Metadata that may be modified are

  • chunks: must be compatible with the dataset shape

  • compression: see h5py.Group.create_dataset()

  • compression_opts: see h5py.Group.create_dataset()

  • dtype: all data in the dataset is cast to the new dtype

  • fillvalue: see the note below

If set to None (the default), the given metadata is not modified.

Note for fillvalue, all values equal to the old fillvalue are updated to be the new fillvalue, regardless of whether they are explicitly stored or represented sparsely in the underlying HDF5 dataset. Also note that datasets without an explicitly set fillvalue have a default fillvalue equal to the default value of the dtype (e.g., 0. for float dtypes).

versioned_hdf5.replay.recreate_dataset(f, name, newf, callback=None)

Recreate dataset from all versions into newf

newf should be a versioned hdf5 file/group that is already initialized (it may or may not be in the same physical file as f). Typically newf should be tmp_group(f) (see tmp_group()).

callback should be a function with the signature

callback(dataset, version_name)

It will be called on every dataset in every version. It should return the dataset to be used for the new version. The dataset and its containing group should not be modified in-place. If a new copy of a dataset is to be used, it should be one of the dataset classes in versioned_hdf5.wrappers, and should placed in a temporary group, which you may delete after recreate_dataset() is done. The callback may also return None, in which case the dataset is deleted for the given version.

Note: this function is only for advanced usage. Typical use-cases should use delete_version() or modify_metadata().

versioned_hdf5.replay.swap(old, new)

Swap every dataset in old with the corresponding one in new

Datasets in old that aren’t in new are ignored.

versioned_hdf5.replay.tmp_group(f)

Create a temporary group in f for use with recreate_dataset().