uzuki2
Recovering R lists faithfully from HDF5 or JSON
|
The uzuki2 repository describes a language-agnostic file format for serializing basic R lists. List elements may be atomic vectors, NULL
, or nested lists of such objects. It also supports missing values in the vectors and per-element names on the vectors or lists. A mechanism is also provided to handle external references to more complex objects (e.g., S4 classes) that cannot be directly saved into the format.
We support serialization in either HDF5 or (possibly Gzip-compressed) JSON. Both of these are widely used formats and have complementary strengths for list representation. HDF5 supports random access into list components, which can provide optimization opportunities when the list is large and/or contains large atomic vectors. In contrast, JSON is easier to parse and has less storage overhead per list element.
The full HDF5 specification is provided here.
The full JSON specification is provided here.
A reference implementation of the validator is provided as a header-only C++ library in include/uzuki2
. This is useful for portable deployment in different frameworks like R, Python, etc. We can check that a JSON/HDF5 file complies with the uzuki specification:
This will raise an error if any violations of the specification are observed. If a non-zero expected number of external objects is present:
Advanced users can also use the uzuki2 parser to load the list into memory. This is achieved by calling parse()
with custom provisioner and external reference classes. For example, tests/src/test_subclass.h
defines the DefaultProvisioner
and DefaultExternals
classes, which can be used to load the HDF5 contents into std::vector
s for easier downstream operations.
The parser supports multiple specification versions, though note the version number of the specification has no direct relationship to the version number of the uzuki2 library.
Library version | HDF5 version | JSON version |
---|---|---|
1.0.x | 1.0 | 1.0 |
1.1.x | 1.0 - 1.1 | 1.0 - 1.1 |
1.2.x | 1.0 - 1.2 | 1.0 - 1.2 |
1.3.x | 1.0 - 1.3 | 1.0 - 1.2 |
Also see the reference documentation for more details.
FetchContent
If you're using CMake, you just need to add something like this to your CMakeLists.txt
:
Then you can link to uzuki2 to make the headers available during compilation:
find_package()
You can install the library by cloning a suitable version of this repository and running the following commands:
Then you can use find_package()
as usual:
If you're not using CMake, the simple approach is to just copy the files in the include/
subdirectory - either directly or with Git submodules - and include their path during compilation with, e.g., GCC's -I
. You will also need to link to the dependencies listed in the extern/CMakeLists.txt
directory, along with the HDF5 and Zlib libraries.
See here for a list of changes from the original uzuki library.
Just like the original uzuki, we're just re-using the reference to Uzuki Shimamura for the name: