ritsuko
Helper utilities for ArtifactDB C++ code
|
Assorted functions for handling ritsuko's custom VLS arrays. More...
Classes | |
struct | Pointer |
Pointer into the VLS heap. More... | |
class | Stream1dArray |
Stream a 1-dimensional VLS array into memory. More... | |
Functions | |
H5::DataSet | open_pointers (const H5::Group &handle, const char *name, size_t offset_precision, size_t length_precision) |
H5::DataSet | open_heap (const H5::Group &handle, const char *name) |
template<typename Offset_ , typename Length_ > | |
H5::CompType | define_pointer_datatype () |
void | validate_pointer_datatype (const H5::CompType &type, size_t offset_precision, size_t length_precision) |
template<typename Offset_ , typename Length_ > | |
void | validate_1d_array (const H5::DataSet &handle, hsize_t full_length, hsize_t heap_length, hsize_t buffer_size) |
template<typename Offset_ , typename Length_ > | |
void | validate_nd_array (const H5::DataSet &handle, const std::vector< hsize_t > &dimensions, hsize_t heap_length, hsize_t buffer_size) |
Assorted functions for handling ritsuko's custom VLS arrays.
One weakness of HDF5 is its inability to efficiently handle variable length string (VLS) arrays. Storing them as fixed-length strings requires padding all strings to the longest string, causing an inflation in disk usage that cannot be completely negated by compression. On the other hand, HDF5's own VLS datatype does not compress the strings themselves, only the pointers to those strings.
To patch over this weakness in current versions of HDF5, ritsuko introduces its own concept of a VLS array. This is defined by two HDF5 datasets - one storing a VLS heap, and another storing pointers into that heap.
open_heap()
for details.define_pointer_datatype()
. Each entry of the pointer dataset contains the starting offset and length of a single VLS on the heap. Check out open_pointers()
for details.The idea is to read the pointer dataset into an array of Pointer
instances, and then use the offset and length for each Pointer
to extract a slice of characters from the heap. Each slice defines the VLS corresponding to that Pointer
.
Typically the pointer and heap datasets for a single VLS array will be stored in their own group, where they can be opened by open_pointers()
and open_heap()
, respectively. See also Stream1dArray
to quickly stream the contents of a VLS array.
H5::CompType ritsuko::hdf5::vls::define_pointer_datatype | ( | ) |
Define a compound datatype for a realization of a Pointer
type.
Offset_ | Unsigned integer type for the starting offset, see Pointer::offset . |
Length_ | Unsigned integer type for the string length, see Pointer::length . |
|
inline |
Open a HDF5 dataset containing the VLS heap. This should be a 1-dimensional dataset of unsigned 8-bit integers, representing the concatenation of bytes from all the variable length strings in the VLS array. Ideally, all bytes are referenced by at least one Pointer
in the associated pointer dataset, though this is not required, e.g., if a VLS is replaced with a shorter string.
We use an integer datatype rather than HDF5's own string datatypes to avoid the risk of a naive incorrect interpretation of the heap as an array of fixed-width strings.
To read the heap into memory, we first read the bytes into an uint8_t
array via H5::DataSet::read
, and then access those bytes via an aliased char *
. By comparison, directly reading the bytes into a char
array is susceptible to overflow problems if HDF5 needs to convert the file's unsigned integers to a possibly-signed char
type. Conversely, to write to the heap, we use an unsigned char *
to alias the content of each C-style string (following C++'s strict aliasing rules). We pass this aliasing pointer to H5::DataSet::write
, which trivially converts from NATIVE_UCHAR
in memory to the file's 8-bit unsigned integer datatype.
handle | Group containing the dataset of pointers. |
name | Name of the dataset of pointers. |
|
inline |
Open a HDF5 dataset containing pointers into the VLS heap, used to define the individual strings in a VLS array.
There are no restrictions on the ordering of entries in the pointer dataset. Slices for consecutive Pointer
s do not have to be ordered or contiguous. This allows one or more entries in the Pointer
dataset to be modified without invalidating other entries. Different Pointer
s can even refer to the same or even overlapping slices, which provides some opportunities to improve compression for repeated strings.
An error is raised if the dataset's datatype is not compound or does not meet the expectations defined by validate_pointer_datatype()
.
handle | Group containing the dataset of pointers. |
name | Name of the dataset of pointers. |
offset_precision | Maximum number of bits in the integer type used for the start offset, see Pointer::offset . |
length_precision | Maximum number of bits in the integer type used for the string length, see Pointer::length . |
|
inline |
Check that the pointers for a 1-dimensional VLS array is valid. An error is thrown if any pointers are out of range of the associated heap dataset.
handle | Handle to the pointer dataset for a VLS array, see open_pointers() . |
full_length | Length of the dataset as a 1-dimensional vector. |
heap_length | Length of the heap dataset. |
buffer_size | Size of the buffer for reading pointers by block. |
void ritsuko::hdf5::vls::validate_nd_array | ( | const H5::DataSet & | handle, |
const std::vector< hsize_t > & | dimensions, | ||
hsize_t | heap_length, | ||
hsize_t | buffer_size ) |
Check that the pointers for an N-dimensional VLS array is valid. An error is thrown if any pointers are out of range of the associated heap dataset.
handle | Handle to the pointer dataset for a VLS array, see open_pointers() . |
dimensions | Dimensions of the dataset. |
heap_length | Length of the heap dataset. |
buffer_size | Size of the buffer for reading pointers by block. |
|
inline |
Validate that a compound HDF5 datatype meets the expectations for storage in a Pointer
instance, specifically:
offset
and is of an integer datatype that is unsigned and has no more than than offset_precision
bits.length
and is of an integer datatype that is unsigned and has no more than than length_precision
bits.The constraints on the precision of each integer type ensure that the pointer dataset can be represented in memory by the associated type. For example, setting offset_precision = 64
allows readers to safely assume that a uint64_t
can be used for Pointer::offset
.
On success, the contents of the HDF5 dataset associated with type
can be safely read into an array of appropriately parameterized Pointer
instances. Otherwise, an error is thrown.
type | Compound datatype, typically generated from a H5::DataSet instance. |
offset_precision | Maximum number of bits in the integer type used for the start position, see Pointer::offset . |
length_precision | Maximum number of bits in the integer type used for the string size, see Pointer::length . |