API Documentation¶
VersionedHDF5File¶
Public API functions
Everything outside of this file is considered internal API and is subject to change.
- class versioned_hdf5.api.VersionedHDF5File(f)¶
A Versioned HDF5 File
This is the main entry-point of the library. To use a versioned HDF5 file, pass a h5py file to constructor. The methods on the resulting object can be used to view and create versions.
Note that versioned HDF5 files have a special structure and should not be modified directly. Also note that once a version is created in the file, it should be treated as read-only. Some protections are in place to prevent accidental modification, but it is not possible in the HDF5 layer to make a dataset or group read-only, so modifications made outside of this library could result in breaking things.
>>> import h5py >>> f = h5py.File('file.h5') >>> from versioned_hdf5 import VersionedHDF5File >>> file = VersionedHDF5File(f)
Access versions using indexing
>>> version1 = file['version1']
This returns a group containing the datasets for that version.
To create a new version, use
stage_version()
.>>> with file.stage_version('version2') as group: ... group['dataset'] = ... # Modify the group ...
When the context manager exits, the version will be written to the file.
Finally, use
>>> file.close()
to close the
VersionedHDF5File
object (note that theh5py
file object should be closed separately.)- close()¶
Make sure the VersionedHDF5File object is no longer reachable.
- property current_version¶
The current version.
The current version is used as the default previous version to
stage_version()
, and is also used for negative integer version indexing (the current version isself[0]
).
- property data_version_identifier: str¶
Return the data version identifier.
Different versions of versioned-hdf5 handle data slightly differently. This string affects whether the version of versioned-hdf5 is compatible with the given file.
If no data version attribute is found, it is assumed to be
1
.Returns¶
- str
The data version identifier string
- get_diff(name: str, version1: str, version2: str) Dict[Tuple[slice], Tuple[ndarray]] ¶
Compute the difference between two versions of a dataset.
Parameters¶
- namestr
Name of the dataset
- version1str
First version to compare
- version2str
Second version to compare
Returns¶
- Dict[Tuple[slice], Tuple[np.ndarray]]
A dictionary where the keys are slices that changed from version1 to version2, the the values are tuples containing
(data_in_version1, data_in_version2)
- rebuild_hashtables()¶
Delete and rebuild all existing hashtables for the raw datasets.
- rebuild_object_dtype_hashtables()¶
Find all dtype=’O’ data groups and rebuild their hashtables.
- stage_version(version_name: str, prev_version=None, make_current=True, timestamp=None)¶
Return a context manager to stage a new version
The context manager returns a group, which should be modified in-place to build the new version. When the context manager exits, the new version will be written into the file.
version_name
should be the name for the version.prev_version
should be the previous version which this version is based on. The group returned by the context manager will mirror this previous version. If it isNone
(the default), the previous version will be the current version. If it is''
, there will be no previous version.If
make_current
isTrue
(the default), the new version will be set as the current version. The current version is used as the defaultprev_version
for any futurestage_version
call.timestamp
may be a datetime.datetime or np.datetime64 timestamp for the version. Note that datetime.datetime timestamps must be in the UTC timezone (np.datetime64 timestamps are not timezone aware and are assumed to be UTC). Iftimestamp
isNone
(the default) the current time when the context manager exits is used. When passing in a manual timestamp, be aware that no consistency checks are made to ensure that version timestamps are linear or not duplicated.
Version Replaying¶
The functions in this module allow replaying versions in a file in-place, in order to globally modify metadata across all versions that otherwise cannot be changed across versions, such as the dtype of a dataset. This also allows editing data in old versions, and deleting datasets or versions.
- versioned_hdf5.replay.delete_version(f: VersionedHDF5File | h5py.File, versions_to_delete: str | Iterable[str])¶
Completely delete the given versions from a file
This function should be used instead of deleting the version group directly, as this will not delete the underlying data that is unique to the version.
- versioned_hdf5.replay.delete_versions(f: VersionedHDF5File | h5py.File, versions_to_delete: str | Iterable[str])¶
Completely delete the given versions from a file
This function should be used instead of deleting the version group directly, as this will not delete the underlying data that is unique to the version.
- versioned_hdf5.replay.modify_metadata(f, dataset_name, *, chunks=None, compression=None, compression_opts=None, dtype=None, fillvalue=None)¶
Modify metadata for a versioned dataset in-place.
The metadata is modified for all versions containing a dataset.
f
should be the h5py file or versioned_hdf5 VersionedHDF5File object.dataset_name
is the name of the dataset in the version group(s).Metadata that may be modified are
chunks
: must be compatible with the dataset shapecompression
: seeh5py.Group.create_dataset()
compression_opts
: seeh5py.Group.create_dataset()
dtype
: all data in the dataset is cast to the new dtypefillvalue
: see the note below
If set to
None
(the default), the given metadata is not modified.Note for
fillvalue
, all values equal to the old fillvalue are updated to be the new fillvalue, regardless of whether they are explicitly stored or represented sparsely in the underlying HDF5 dataset. Also note that datasets without an explicitly set fillvalue have a default fillvalue equal to the default value of the dtype (e.g., 0. for float dtypes).
- versioned_hdf5.replay.recreate_dataset(f, name, newf, callback=None)¶
Recreate dataset from all versions into
newf
newf
should be a versioned hdf5 file/group that is already initialized (it may or may not be in the same physical file as f). Typicallynewf
should betmp_group(f)
(seetmp_group()
).callback
should be a function with the signaturecallback(dataset, version_name)
It will be called on every dataset in every version. It should return the dataset to be used for the new version. The dataset and its containing group should not be modified in-place. If a new copy of a dataset is to be used, it should be one of the dataset classes in versioned_hdf5.wrappers, and should placed in a temporary group, which you may delete after
recreate_dataset()
is done. The callback may also return None, in which case the dataset is deleted for the given version.Note: this function is only for advanced usage. Typical use-cases should use
delete_version()
ormodify_metadata()
.
- versioned_hdf5.replay.swap(old, new)¶
Swap every dataset in old with the corresponding one in new
Datasets in old that aren’t in new are ignored.
- versioned_hdf5.replay.tmp_group(f)¶
Create a temporary group in
f
for use withrecreate_dataset()
.