# Design Versioned-hdf5 is built as a wrapper on top of h5py. The basic idea behind the design is that versioned-hdf5 is a [copy-on-write](https://en.wikipedia.org/wiki/Copy-on-write) system, inspired by git as well as modern filesystems such as APFS and Btrfs. Copy-on-write is a good fit whenever data should be completely immutable. In a copy-on-write system, any modification to a piece of data produces a new copy of the data, leaving the original intact. Any references to the original will continue to point to it. This is implemented using two key HDF5 primitives: chunks and virtual datasets. In HDF5, datasets are split into multiple chunks. Each chunk is of equal size, which is configurable, although some chunks may not be completely full. A chunk is the smallest part of a dataset that HDF5 operates on. Whenever a subset of a dataset is to be read, the entire chunk containing that dataset is read into memory. Picking an optimal chunk size is a nontrivial task, and depends on things such as the size of your L1 cache and the typical shape of your dataset. Furthermore, in versioned-hdf5 a chunk is the smallest amount of data that is stored only once across versions if it has not changed. If the chunk size is too small, it would affect performance, as operations would require reading and writing more chunks, but if it is too large, it would make the resulting versioned file unnecessarily large, as changing even a single element of a chunk requires rewriting the entire chunk. Versioned-hdf5 does not presently contain any logic for automatically picking a chunk size. The [pytables documentation](https://www.pytables.org/usersguide/optimization.html) has some tips on picking an optimal chunk size. [Virtual datasets](http://docs.h5py.org/en/stable/vds.html) are a special kind of dataset that reference data from other datasets in a seamless way. The data from each part of a virtual dataset comes from another dataset. HDF5 does this seamlessly, so that a virtual dataset appears to be a normal dataset. The basic design of versioned-hdf5 is this: whenever a dataset is created for the first time (the first version containing the dataset), it is split into chunks. The data in each chunk is hashed and stored in a hash table. The unique chunks are then appended into to a `raw_data` dataset corresponding to the dataset. Finally, a virtual dataset is made that references the corresponding chunks in the raw dataset to recreate the original dataset. When later versions modify this dataset, each modified chunk is appended to the raw dataset, and a new virtual dataset is created pointing to corresponding chunks. For example, say we start with the first version, `version_1`, and create a dataset `my_dataset` with `n` chunks. The dataset chunks will be written into the raw dataset, and the final virtual dataset will point to those chunks. ```{graphviz} digraph g { graph [ rankdir = "LR" ]; node [ fontsize = "16" ]; edge [ ]; "dataset (version_1)" [ label = "my_dataset (version_1)|CHUNK 0|CHUNK 1|...|CHUNK n" shape = "record" ]; "raw_data" [ label = "raw_data|CHUNK 0|CHUNK 1|...|CHUNK n" shape = "record" ]; "dataset (version_1)":f0 -> "raw_data":f0 []; "dataset (version_1)":f1 -> "raw_data":f1 []; "dataset (version_1)":fdot -> "raw_data":fdot []; "dataset (version_1)":fn -> "raw_data":fn []; } ``` If we then create a version `version_2` based off `version_1`, and modify only data contained in CHUNK 2, that new data will be appended to the raw dataset, and the resulting virtual dataset for `version_2` will look like this: ```{graphviz} digraph g { graph [ rankdir = "LR" ]; node [ fontsize = "16" ]; edge [ ]; "dataset (version_1)" [ label = "my_dataset (version_1)|CHUNK 0|CHUNK 1|...|CHUNK n" shape = "record" ]; "dataset (version_2)" [ label = "my_dataset (version_2)|CHUNK 0|CHUNK 1|...|CHUNK n" shape = "record" ]; "raw_data" [ label = "raw_data|CHUNK 0|CHUNK 1|...|CHUNK n|CHUNK n+1" shape = "record" ]; "dataset (version_1)":f0 -> "raw_data":f0 []; "dataset (version_1)":f1 -> "raw_data":f1 []; "dataset (version_1)":fdot -> "raw_data":fdot []; "dataset (version_1)":fn -> "raw_data":fn []; "raw_data":f0 -> "dataset (version_2)":f0 [dir=back]; "dataset (version_2)":fdot -> "raw_data":fdot []; "dataset (version_2)":f1 -> "raw_data":fn1 []; "dataset (version_2)":fn -> "raw_data":fn []; } ``` Since both versions 1 and 2 of `my_dataset` have identical data in chunks other than CHUNK 2, they both point to the exact same data in `raw_data`. Thus, the underlying HDF5 file only stores the data in version 1 of `my_dataset` once, and only the modified chunks from `version_2`'s `my_dataset` are stored on top of that. All extra metadata, such as attributes, is stored on the virtual dataset. Since virtual datasets act exactly like real datasets and operate at the HDF5 level, each version is a real group in the HDF5 file that is exactly that version. However, these groups should be treated as read-only, and you should never access them outside of the versioned-hdf5 API (see below). ## HDF5 File Layout Inside of the HDF5 file, there is a special `_versioned_data` group that holds all the internal data for versioned-hdf5. This group contains a `versions` group, which contains groups for each version that has been created. It also contains a group for each dataset that exists in a version. These groups each contain two datasets, `hash_table`, and `raw_data`. For example, consider a versioned-hdf5 file that contains two versions, `version1`, and `version2`, with datasets `data1` and `data2`. Suppose also that `data1` exists in both versions and `data2` only exists in `version2`. The HDF5 layout would look like this ``` /_versioned_data/ ├── data1/ │ ├── hash_table │ └── raw_data │ ├── data2/ │ ├── hash_table │ └── raw_data │ └── versions/ ├── __first_version__/ │ ├── version1/ │ └── data1 │ └── version2/ ├── data1 └── data2 ``` `__first_version__` is an empty group that exists only for internal bookkeeping purposes (see below). ## Submodule Organization The versioned-hdf5 code is split into four layers, the backend, the versions, the h5py wrappers, and the top-level API. ### Backend The backend layer is the bottommost layer. It is the only layer that does actual dataset writes to HDF5. It deals with the splitting of chunks from the versioned dataset and creation of the virtual datasets that compromise the version groups. The relevant modules are `versioned_hdf5.backend` and `versioned_hdf5.hashtable`. `versioned_hdf5.backend.write_dataset()` takes a dataset (or array) and writes it to the raw data for the given dataset. The data in each chunk of the dataset is SHA256 hashed, and the hash is looked up in the hashtable dataset. If it already exists in the raw data, that chunk in the raw data is reused. The hashtable maps `SHA256 hash -> (start, stop)` where `(start, stop)` gives a slice range for the chunk in the raw dataset (chunks in the `raw_data` dataset are concatenated along the first axis only). All chunks that do not exist in the hashtable already are appended to the raw dataset and added to the hashtable. `versioned_hdf5.backend.write_dataset_chunks()` works similarly, except instead of taking a dataset as input, it takes an dictionary mapping chunks. This allows the higher levels of the API to only pass in the chunks of an existing dataset that have been modified (see below). `versioned_hdf5.backend.create_virtual_dataset()` creates a virtual dataset in the version group pointing to corresponding chunks in the raw dataset. `versioned_hdf5.backend` also has various functions for initializing a dataset the first time it is created in a version. `versioned_hdf5.hashtable` contains a `Hashtable` object that wraps the hashtable dataset in HDF5 as a dict-like object. ### Versions Each version is stored as a subgroup of the `_versioned_data/versions/` group. The group contains attributes that reference the previous version, as well as metadata like the timestamp when the version was created. Consequently, the versions form a DAG. However, the reference to the previous version is only used by the top-level API that allows traversing versions. Each version group is self-contained, containing only virtual datasets that point only to the respective raw datasets. Versioned-hdf5 also keeps track of the "current version", which is used only to allow previous version to not be specified when creating a new version (this information is stored on the attributes of the `_versioned_data/versions` group). If a version does not have a previous version, its previous version is the special empty `__first_version__` version group. `versioned_hdf5.versions` contains functions to create a version group, commit a version, and access and manipulate versions. The main function here is `versioned_hdf5.versions.commit_version()`, which is called with all the datasets that should be committed to the new version when the `VersionedHDF5File.stage_version()` context manager exits. ### h5py Wrappers One minor issue with the copy-on-write idea is that HDF5 does have a native way to make virtual datasets read-only. If you modify a virtual dataset, it will also modify the dataset that it points to. In our design, this would modify all other versions of a dataset pointing to the same raw data chunks. Hence, versioned-hdf5 provides wrappers to the various h5py objects that implement the proper copy-on-write semantics. Versioned HDF5 files should only be interacted with via the versioned-hdf5 library. Writing to a versioned dataset directly with h5py or another HDF5 wrapper library may lead to data corruption, as common data is shared between versions. The objects for this layer all live in `versioned_hdf5.wrappers`. The primary objects are `InMemoryGroup`: This is the object returned by the `VersionedHDF5File.stage_version()` context manager. It acts like an `h5py.Group` object, but all data is stored in memory. This is done efficiently so that only data that is modified is actually read in from the file. This object is also used for any subgroups of the version group. The primary purpose of this object is to keep track of what has been modified while a version is being staged. Once the `stage_version()` context manager exits, this object is passed to `commit_version()` (see above), which extracts the relevant information about what datasets exist in the new version and how they relate to previous versions, if there are any. `InMemoryArrayDataset`: This objects acts like a `h5py.Dataset`, but wraps a NumPy array in memory. This object is used whenever a dataset is created for the first time. `InMemoryDataset`: This objects acts like a `h5py.Dataset`. It is used whenever a dataset in a version already exists from a previous version. This object stores only those chunks of the dataset in memory that are actually read in or modified. This is not only more memory efficient, but it allows passing only the modified chunks as arrays to the backend. The remaining chunks will then automatically point to the chunks in the raw data that they pointed to in the previous version, without needing to re-hash the data. One challenge with this design is that `InMemoryDataset` represents a single dataset that is broken up into chunks, which live in the raw dataset and may not be contiguous. The [ndindex](https://quansight.github.io/ndindex/index.html) library is used to manage translation of indices on the dataset to and from the chunked data. ndindex is also used throughout versioned-hdf5 to store and manipulate slice and other index objects, as it is more convenient than using the raw index types. For example, in the backend, we need to store slices in a dictionary. The default Python `slice` object is not hashable, which makes this annoying to do. The ndindex index objects are all hashable. The ndindex library was initially created for versioned-hdf5, in order to make index manipulation possible as well as allowing code that passes indices around to become much cleaner. These wrapper objects all try to emulate the h5py API as closely as possible, so that the user can use them just as they would the real h5py objects. Any discrepancy between h5py and versioned-hdf5 semantics should be considered a bug in versioned-hdf5. ### Top-level API The top-level API consists of one object, {any}`VersionedHDF5File`. This object allows accessing versions via getitem, like `VersionedHDF5File(f)[version_name]`. The primary use of this object, however, is the `stage_version()` method, which is a context manager that returns a group for a new version. The way to make a new version is ```py import h5py from versioned_hdf5 import VersionedHDF5File f = h5py.File(...) file = VersionedHDF5File(f) # new_version and prev_version are strings corresponding the the version names # for the new and previous versions with file.stage_version(new_version, prev_version) as g: g['dataset'][0] = 1 # Modify a dataset from prev_version g['dataset'].resize(...) # Resize a dataset from prev_version g.create_dataset('dataset2', ...) # Create a new dataset g.create_group('new_group') # Create a new subgroup ``` Inside of the context manager, the group `g` will look exactly like the previous version `prev_version`, but modifications to it will not actually modify `prev_version`. Rather, they will stage changes for the new version `new_version`. When the context manager exits, whatever the state of the version group `g` is will be written as `new_version`. Any data chunks from `prev_version` that were not modified will be reused as described above. Once a version is committed (after the context manager exits), it should be treated as read-only. The versioned-hdf5 objects have some safeguards to prevent accidentally writing to existing versioned data, but the underlying h5py has no such safeguards, since there are no notions of read-only datasets in HDF5 itself, so these safeguards should not be relied on.