Quickstart Guide

Let’s say you have an HDF5 file with contents that might change over time. You may add or remove datasets, change the contents of the data or the metadata, and would like to keep a record of which changes occurred when, and a way to recover previous versions of this file. Versioned HDF5 allows you to do that by building a versioning API on top of h5py.

First, you must open an .h5 file and create a h5py File Object in write mode:

>>> import h5py
>>> fileobject = h5py.File('filename.h5', 'w')

Now, you can use the VersionedHDF5File constructor on this file object to create a versioned HDF5 file object:

>>> from versioned_hdf5 import VersionedHDF5File
>>> versioned_file = VersionedHDF5File(fileobject)

You can see that this versioned_file object has the following attributes:

  • f: the original h5py File Object;

  • current_version: at this point, it should return __first_version__, as we haven’t created any additional versions.

To create a new version, use the stage_version function. For example, if we do

>>> with versioned_file.stage_version('version2') as group:
...     group['mydataset'] = np.ones(10000)

The context manager returns a h5py group object, which should be modified in-place to build the new version. When the context manager exits, the version will be written to the file. This has two effects. First, the h5py file object fileobject now has metadata associated with versions:

>>> fileobject.keys()
<KeysViewHDF5 ['_version_data']>

All the data from the versioned HDF5 file is stored in the _version_data group on the file, but this should not be accessed directly: any interaction with the versioning should happen through the API. versioned_file can now be used to expose versioned data by version name:

>>> v2 = versioned_file['version2']
>>> v2
<Committed InMemoryGroup "/_version_data/versions/version2">
>>> v2['mydataset']
<InMemoryArrayDataset "mydataset": shape (10000,), type "<f8">

To access the actual data stored in version version2, we use the same syntax as h5py:

>>> dataset = v2['mydataset']
>>> dataset[()]
array([1., 1., 1., ..., 1., 1., 1.])

Note

Versioned HDF5 files have a special structure and should not be modified directly. Also note that once a version is created in the file, it should be treated as read-only. Some protections are in place to prevent accidental modification, but it is not possible in the HDF5 layer to make a dataset or group read-only, so modifications made outside of this library could result in breaking things.

When you are done manipulating data, both the h5py and VersionedHDF5File objects must be closed to make sure the HDF5 file is written properly to disk (including data about versions.) This can be achieved by

>>> fileobject.close()
>>> versioned_file.close()

Other Options

When a version is committed to a VersionedHDF5File, a timestamp is automatically added to it. The timestamp for each version can be retrieved via the version’s attrs:

>>> versioned_file['version1'].attrs['timestamp']

Since the HDF5 specification does not currently support writing datetime.datetime or numpy.datetime objects to HDF5 files, these timestamps are stored as strings, using the following format:

``"%Y-%m-%d %H:%M:%S.%f%z"``

The timestamps are registered in UTC. For more details on the format string above, see the datetime.datetime.strftime function documentation.

The timestamp can also be used as an index to retrieve a chosen version from the file. In this case, either a datetime.datetime or a numpy.datetime64 object must be used as a key. For example, if

>>> t = datetime.datetime.now(datetime.timezone.utc)

then using

>>> versioned_file[t]

returns the version with timestamp equal to t (converted to a string according to the format mentioned above).

It is also possible to assign a timestamp manually to a file. Again, this requires using either a datetime.datetime or a numpy.datetime64 object as the timestamp:

>>> ts = datetime.datetime(2020, 6, 29, 23, 58, 21, 116470, tzinfo=datetime.timezone.utc)
>>> with versioned_file.stage_version('version1', timestamp=ts) as group:
>>>    group['mydataset'] = data

Now:

>>> versioned_file[ts]

returns the same as versioned_file['version1'].