uzuki2
Recovering R lists faithfully from HDF5 or JSON
|
The uzuki2 repository describes a language-agnostic file format for serializing basic R lists. List elements may be atomic vectors, NULL
, or nested lists of such objects. It also supports missing values in the vectors and per-element names on the vectors or lists. A mechanism is also provided to handle external references to more complex objects (e.g., S4 classes) that cannot be directly saved into the format.
We support serialization in either HDF5 or (possibly Gzip-compressed) JSON. Both of these are widely used formats and have complementary strengths for list representation. HDF5 supports random access into list components, which can provide optimization opportunities when the list is large and/or contains large atomic vectors. In contrast, JSON is easier to parse and has less storage overhead per list element.
Both the HDF5 and JSON specifications have multiple versions. Links to the version-specific HDF5 specifications are listed below, along with the minimum version of the C++ library required to parse them:
Similarly, different versions of the JSON specification are listed below:
A reference implementation of the validator is provided as a header-only C++ library in include/uzuki2
. This is useful for portable deployment in different frameworks like R, Python, etc. We can check that a JSON/HDF5 file complies with the uzuki specification:
This will raise an error if any violations of the specification are observed. If a non-zero expected number of external objects is present:
Advanced users can also use the uzuki2 parser to load the list into memory. This is achieved by calling parse()
with custom provisioner and external reference classes. For example, tests/src/test_subclass.h
defines the DefaultProvisioner
and DefaultExternals
classes, which can be used to load the HDF5 contents into std::vector
s for easier downstream operations.
See the reference documentation for more details.
FetchContent
If you're using CMake, you just need to add something like this to your CMakeLists.txt
:
Then you can link to uzuki2 to make the headers available during compilation:
find_package()
You can install the library by cloning a suitable version of this repository and running the following commands:
Then you can use find_package()
as usual:
If you're not using CMake, the simple approach is to just copy the files in the include/
subdirectory - either directly or with Git submodules - and include their path during compilation with, e.g., GCC's -I
. You will also need to link to the dependencies listed in the extern/CMakeLists.txt
directory, along with the HDF5 and Zlib libraries.
See here for a list of changes from the original uzuki library.
Just like the original uzuki, we're just re-using the reference to Uzuki Shimamura for the name: