Quickstart Guide ================ Let's say you have an HDF5 file with contents that might change over time. You may add or remove datasets, change the contents of the data or the metadata, and would like to keep a record of which changes occurred when, and a way to recover previous versions of this file. Versioned HDF5 allows you to do that by building a versioning API on top of h5py. First, you must open an ``.h5`` file and create a `h5py File Object `__ in write mode:: >>> import h5py >>> fileobject = h5py.File('filename.h5', 'w') Now, you can use the :any:`VersionedHDF5File` constructor on this file object to create a versioned HDF5 file object:: >>> from versioned_hdf5 import VersionedHDF5File >>> versioned_file = VersionedHDF5File(fileobject) You can see that this ``versioned_file`` object has the following attributes: - ``f``: the original ``h5py`` File Object; - ``current_version``: at this point, it should return ``__first_version__``, as we haven't created any additional versions. To create a new version, use the :any:`stage_version` function. For example, if we do .. code:: >>> with versioned_file.stage_version('version2') as group: ... group['mydataset'] = np.ones(10000) The context manager returns a h5py *group* object, which should be modified in-place to build the new version. When the context manager exits, the version will be written to the file. This has two effects. First, the h5py file object ``fileobject`` now has metadata associated with versions:: >>> fileobject.keys() All the data from the versioned HDF5 file is stored in the ``_version_data`` group on the file, but this should not be accessed directly: any interaction with the versioning should happen through the API. ``versioned_file`` can now be used to expose versioned data by version name:: >>> v2 = versioned_file['version2'] >>> v2 >>> v2['mydataset'] To access the actual data stored in version ``version2``, we use the same syntax as ``h5py``:: >>> dataset = v2['mydataset'] >>> dataset[()] array([1., 1., 1., ..., 1., 1., 1.]) .. note:: Versioned HDF5 files have a special structure and should not be modified directly. Also note that once a version is created in the file, it should be treated as read-only. Some protections are in place to prevent accidental modification, but it is not possible in the HDF5 layer to make a dataset or group read-only, so modifications made outside of this library could result in breaking things. When you are done manipulating data, both the ``h5py`` and ``VersionedHDF5File`` objects must be closed to make sure the HDF5 file is written properly to disk (including data about versions.) This can be achieved by .. code:: >>> fileobject.close() >>> versioned_file.close() Other Options ------------- When a version is committed to a VersionedHDF5File, a timestamp is automatically added to it. The timestamp for each version can be retrieved via the version's ``attrs``:: >>> versioned_file['version1'].attrs['timestamp'] Since the HDF5 specification does not currently support writing ``datetime.datetime`` or ``numpy.datetime`` objects to HDF5 files, these timestamps are stored as strings, using the following format:: ``"%Y-%m-%d %H:%M:%S.%f%z"`` The timestamps are registered in UTC. For more details on the format string above, see the ``datetime.datetime.strftime`` function documentation. The timestamp can also be used as an index to retrieve a chosen version from the file. In this case, either a ``datetime.datetime`` or a ``numpy.datetime64`` object must be used as a key. For example, if .. code:: >>> t = datetime.datetime.now(datetime.timezone.utc) then using .. code:: >>> versioned_file[t] returns the version with timestamp equal to ``t`` (converted to a string according to the format mentioned above). It is also possible to assign a timestamp manually to a file. Again, this requires using either a ``datetime.datetime`` or a ``numpy.datetime64`` object as the timestamp:: >>> ts = datetime.datetime(2020, 6, 29, 23, 58, 21, 116470, tzinfo=datetime.timezone.utc) >>> with versioned_file.stage_version('version1', timestamp=ts) as group: >>> group['mydataset'] = data Now:: >>> versioned_file[ts] returns the same as ``versioned_file['version1']``.