comservatory
Strict validation of CSV files in C++
|
The comma separated value (CSV) format uses a tabular layout where each record starts on a new line and fields are separated by commas. Beyond this, though, there is little standardization on the specifics of how the fields are interpreted. The comservatory repository defines an opinionated version of a "CSV standard" and provides a header-only C++ library to validate and load a CSV file. It is primarily intended for use in data management applications that write CSV files and want to provide some structural guarantees for downstream users.
Each record in the file should have the same number and type of fields. The following records are consistent:
Whereas these are not:
Fields are inferred based on the types of the entries in the subsequent records, as described in the rest of this document. This is usually unambiguous unless a field contains all-missing (NA
) values, in which case it has indeterminate type.
Each record should start on a newline but may span multiple lines if a String
field contains an entry with a newline. For example, a record containing [1, 2, "a\nb", 4]
would be formatted as:
The first line(s) of the file should contain a header that defines the names of the columns. Each name should follow the format described for String
s and should be unique across all columns. Again, the header line is allowed to span multiple lines if its entries contain newlines. For example, a header of ["aasdas", "qwert,asdas", "voo\ndasdsd"]
would look like:
The last line of the file should be terminated with a newline.
Zero-column datasets are represented by an empty line in the header and one empty line per record.
There is no support for comment characters or lines.
The current version of the comservatory specification is 1.0. (This should not be confused with the release versions of the comservatory C++ library.)
All strings should be enclosed in double quotes. The only exception is that of missing strings, which are denoted as NA
without quotes. For example, the fourth record of the first record is missing below:
If the string itself contains a double quote, that quote should be escaped by another double quote. For example, to represent the x y "z"
, we would use:
Strings may contain any number of other characters, including commas and newlines.
Encoding is assumed to be UTF-8, so Unicode characters are allowed.
Numeric entries can be written as simple integers, consisting of only 0-9
characters. The numeric characters may be preceded by a single hyphen or plus sign to represent a negative or positive number, respectively.
Any number of leading zeros are permitted, though they are somewhat unusual.
Numeric entries may also contain a single intervening decimal point. There must be numeric characters on both sides of the point. Entries may be preceded by a single hyphen or plus sign for negative or positive numbers, respectively. Leading zeros are still permitted.
Alternatively, numbers can be stored in scientific format. This follows the format XeY
where:
X
lies in [1, 10)
.X
is formatted as an integer or with a decimal point, as described above.Y
is formatted as an integer, as described above.It is also permitted to use a capital E
:
This standard does not make a distinction between the different types of notation for Number
entries. One record may store a number as an integer while another record uses decimal notation for the same field.
Missing values are denoted by NA
entries.
Not-a-number values are represented by nan
, -nan
or any capitalization thereof, e.g., NaN
.
Infinite values are represented by inf
, -inf
or any capitalization thereof, e.g., Inf
.
Booleans should be stored as any capitalization of TRUE
or FALSE
:
Missing values are denoted by NA
entries.
Complex numbers are represented by the format A+Bi
where A
and B
are formatted as Real number
fields. Both A
and B
must be present, even if the complex number contains only a real or imaginary part.
Missing values are denoted by NA
entries.
A reference implementation of the validator is provided as a header-only C++ library in include/comservatory
. This is useful for portable deployment in different frameworks like R, Python, etc. Given a path to a CSV file, we can load its contents using the read_file()
function:
If we are only interested in a subset of fields, we can ask read_file()
to only return that subset. Note that all fields are still validated but only the contents of the requested fields are returned in memory - all other fields have placeholder entries.
If only validation is required, we can avoid storing contents in memory by setting validate_only = true
. This will parse the file and throw an error upon encountering an invalid format.
See the reference documentation for more details.
FetchContent
If you're using CMake, you just need to add something like this to your CMakeLists.txt
:
Then you can link to comservatory to make the headers available during compilation:
find_package()
You can install the library by cloning a suitable version of this repository and running the following commands:
Then you can use find_package()
as usual:
If you're not using CMake, the simple approach is to just copy the files in the include/
subdirectory - either directly or with Git submodules - and include their path during compilation with, e.g., GCC's -I
. You will also need to link to byteme directory, along with the Zlib library.
Gzipped CSVs are automatically supported by read_file()
once comservatory is compiled with Zlib support.
Other inputs are supported via the read()
function for byteme::Reader
classes. For example, we can parse a CSV file from an in-memory Zlib-compressed buffer:
Field
sIf the header names and/or field types are known in advance, we can specify them in a Contents
object to be passed to read_file()
. This allows developers to strictly control the contents of each field while it is filled.
If the data in the CSV does not match the supplied information, an error is immediately raised. This is helpful for validation purposes, as opposed to reading the entire file into memory and then checking the contents.
Field
typesDevelopers may define their own Field
subclasses to customize the in-memory representation of the data. For example, the default FilledField
uses a std::vector
to store the data values. If a std::deque
is preferred instead, we could do so by modifying some code from include/comservatory/Field.hpp
:
This requires an accompanying FieldCreator
subclass to direct ReadCsv
to use our newly defined DequeFilledField
subclasses. We'll recycle some code from include/comservatory/Creator.hpp
to demonstrate:
And then we can direct read_file()
to use this new FieldCreator
:
If you see any bugs, report them in the Issues. Pull requests are also welcome.