dolomite_matrix package

Submodules

dolomite_matrix.DelayedMask module

class dolomite_matrix.DelayedMask.DelayedMask(seed, placeholder, dtype=None)[source]

Bases: DelayedOp

Delayed mask to replace the missing value placeholder with a NumPy masked array.

__init__(seed, placeholder, dtype=None)[source]
Parameters:
  • seed – Any object that satisfies the seed contract, see DelayedArray for details.

  • placeholder – Placeholder value for defining masked values, of the same type as seed.dtype (or coercible into that type). All values equal to the placeholder are considered to be missing.

  • dtype (Optional[dtype]) – Desired type of the masked output, defaults to seed.dtype.

property dtype: dtype

Returns: NumPy type for the contents after masking.

property placeholder

Returns: The placeholder value.

property seed

Returns: The seed object.

property shape: Tuple[int, ...]

Returns: Tuple of integers specifying the extent of each dimension of this object. This is the same as the seed object.

dolomite_matrix.DelayedMask.chunk_grid_DelayedMask(x)[source]

See chunk_grid().

dolomite_matrix.DelayedMask.create_dask_array_DelayedMask(x)[source]

See create_dask_array().

dolomite_matrix.DelayedMask.extract_dense_array_DelayedMask(x, subset)[source]

See extract_dense_array().

dolomite_matrix.DelayedMask.extract_sparse_array_DelayedMask(x, subset)[source]

See extract_sparse_array().

dolomite_matrix.DelayedMask.is_masked_DelayedMask(x)[source]

See is_masked().

dolomite_matrix.DelayedMask.is_sparse_DelayedMask(x)[source]

See is_sparse().

dolomite_matrix.ReloadedArray module

class dolomite_matrix.ReloadedArray.ReloadedArray(seed, path)[source]

Bases: DelayedArray

An array that was reloaded from disk by the read_object() function, and remembers the path from which it was loaded. This class allows methods to refer to the existing on-disk representation by inspecting the path. For example, save_object() can just copy/link to the existing files instead of repeating the saving process.

__init__(seed, path)[source]

To construct a ReloadedArray from an existing ReloadedArraySeed, use wrap() instead.

Parameters:
  • seed – The contents of the reloaded array.

  • path (str) – Path to the directory containing the on-disk representation.

property path: str

Returns: Path to the directory containing the on-disk representation.

class dolomite_matrix.ReloadedArray.ReloadedArraySeed(seed, path)[source]

Bases: WrapperArraySeed

Seed for the ReloadedArray class. This is a subclass of WrapperArraySeed.

__init__(seed, path)[source]
Parameters:
  • seed – The contents of the reloaded array.

  • path (str) – Path to the directory containing the on-disk representation.

property path: str

Returns: Path to the directory containing the on-disk representation.

dolomite_matrix.ReloadedArray.save_object_ReloadedArray(x, path, reloaded_array_reuse_mode='link', **kwargs)[source]

Method for saving ReloadedArray objects to disk, see save_object() for details.

Parameters:
  • x (ReloadedArray) – Object to be saved.

  • path (str) – Path to a directory to save x.

  • reloaded_array_reuse_mode (str) – How the files in x.path should be re-used when populating path. This can be "link", to create a hard link to each file; "symlink", to create a symbolic link to each file; "copy", to create a copy of each file; or "none", to perform a fresh save of x without relying on x.path.

  • kwargs – Further arguments, ignored.

Returns:

x is saved to path.

dolomite_matrix.ReloadedArray.wrap_ReloadedArraySeed(x)[source]

See wrap().

Return type:

ReloadedArray

dolomite_matrix.WrapperArraySeed module

class dolomite_matrix.WrapperArraySeed.WrapperArraySeed(seed)[source]

Bases: object

Wrapper for a DelayedArray seed, which forwards all of the required operations to the seed object. This is expected to be used as a base for concrete subclasses that attach more provenance-tracking information - see ReloadedArray for an example.

__annotations__ = {}
__init__(seed)[source]
Parameters:

seed – The underlying seed instance to be wrapped.

property dtype: dtype

Returns: The type of the seed.

property seed

Returns: The underlying seed instance.

property shape: Tuple[int, ...]

Returns: The shape of the seed.

dolomite_matrix.WrapperArraySeed.chunk_grid_WrapperArraySeed(x)[source]

See chunk_grid() for details.

Return type:

Tuple[int, ...]

dolomite_matrix.WrapperArraySeed.create_dask_array_WrapperArraySeed(x)[source]

See create_dask_array() for details.

dolomite_matrix.WrapperArraySeed.extract_dense_array_WrapperArraySeed(x, subset)[source]

See extract_dense_array() for details.

Return type:

ndarray

dolomite_matrix.WrapperArraySeed.extract_sparse_array_WrapperArraySeed(x, subset)[source]

See extract_sparse_array() for details.

Return type:

SparseNdarray

dolomite_matrix.WrapperArraySeed.is_masked_WrapperArraySeed(x)[source]

See is_masked() for details.

Return type:

bool

dolomite_matrix.WrapperArraySeed.is_sparse_WrapperArraySeed(x)[source]

See is_sparse() for details.

Return type:

bool

dolomite_matrix.choose_chunk_dimensions module

dolomite_matrix.choose_chunk_dimensions.choose_chunk_dimensions(shape, size, min_extent=100, buffer_size=10000000.0)[source]

Choose chunk dimensions to use for a dense HDF5 dataset. For each dimension, we consider a slice of the array that consists of the full extent of all other dimensions. We want this slice to occupy less than buffer_size in memory, and we resize the slice along the current dimension to achieve this. The chunk size is then chosen as the size of the slice along the current dimension. This ensures that efficient iteration along each dimension will not use any more than buffer_size bytes.

Parameters:
  • shape (Tuple[int, ...]) – Shape of the array.

  • size (int) – Size of each array element in bytes.

  • min_extent (int) – Minimum extent of each chunk dimension, to avoid problems with excessively small chunk sizes when the data is large.

  • buffer_size (int) – Size of the (conceptual) memory buffer to use for storing blocks of data during iteration through the array, in bytes.

Return type:

Tuple[int, ...]

Returns:

Tuple containing the chunk dimensions.

dolomite_matrix.read_compressed_sparse_matrix module

dolomite_matrix.read_compressed_sparse_matrix.read_compressed_sparse_matrix(path, metadata, **kwargs)[source]

Read a compressed sparse matrix from its on-disk representation. In general, this function should not be called directly but instead be dispatched via read_object().

Parameters:
  • path (str) – Path to the directory containing the object.

  • metadata (Dict[str, Any]) – Metadata for this object.

  • kwargs – Further arguments, ignored.

Return type:

ReloadedArray

Returns:

A ReloadedArray containing a HDF5-backed compressed sparse matrix as a seed.

dolomite_matrix.read_dense_array module

dolomite_matrix.read_dense_array.read_dense_array(path, metadata, **kwargs)[source]

Read a dense array from its on-disk representation. In general, this function should not be called directly but instead be dispatched via read_object().

Parameters:
  • path (str) – Path to the directory containing the object.

  • metadata (Dict[str, Any]) – Metadata for this object.

  • kwargs – Further arguments, ignored.

Return type:

ReloadedArray

Returns:

A ReloadedArray containing a HDF5-backed dense array as a seed.

dolomite_matrix.save_compressed_sparse_matrix module

dolomite_matrix.save_compressed_sparse_matrix.save_compresssed_sparse_matrix_from_Sparse2darray(x, path, compressed_sparse_matrix_chunk_size=10000, compressed_sparse_matrix_buffer_size=100000000.0, **kwargs)[source]

Method for saving a SparseNdarray to disk, see save_object() for details.

Parameters:
  • x (SparseNdarray) – Object to be saved.

  • path (str) – Path to a directory to save x.

  • compressed_sparse_matrix_chunk_size (int) – Chunk size for the data and indices. Larger values improve compression at the potential cost of reducing random access efficiency.

  • compressed_sparse_matrix_buffer_size (int) – Size of the buffer in bytes, for blockwise processing and writing to file. Larger values improve speed at the cost of memory.

  • kwargs – Further arguments, ignored.

Returns:

x is saved to path.

dolomite_matrix.save_compressed_sparse_matrix.save_compresssed_sparse_matrix_from_scipy_coo_matrix(x, path, compressed_sparse_matrix_chunk_size=10000, compressed_sparse_matrix_buffer_size=100000000.0, **kwargs)[source]

Method for saving coo_matrix objects to disk, see stage_object() for details.

Parameters:
  • x (coo_matrix) – Matrix to be saved.

  • path (str) – Path to a directory to save x.

  • compressed_sparse_matrix_chunk_size (int) – Chunk size for the data and indices. Larger values improve compression at the potential cost of reducing random access efficiency.

  • compressed_sparse_matrix_cache_size – Size of the buffer in bytes, for blockwise processing and writing to file. Larger values improve speed at the cost of memory.

  • kwargs – Further arguments, ignored.

Returns:

x is saved to path.

dolomite_matrix.save_compressed_sparse_matrix.save_compresssed_sparse_matrix_from_scipy_csc_matrix(x, path, compressed_sparse_matrix_chunk_size=10000, compressed_sparse_matrix_buffer_size=100000000.0, **kwargs)[source]

Method for saving csc_matrix objects to disk, see stage_object() for details.

Parameters:
  • x (csc_matrix) – Matrix to be saved.

  • path (str) – Path to a directory to save x.

  • compressed_sparse_matrix_chunk_size (int) – Chunk size for the data and indices. Larger values improve compression at the potential cost of reducing random access efficiency.

  • compressed_sparse_matrix_cache_size – Size of the buffer in bytes, for blockwise processing and writing to file. Larger values improve speed at the cost of memory.

  • kwargs – Further arguments, ignored.

Returns:

x is saved to path.

dolomite_matrix.save_compressed_sparse_matrix.save_compresssed_sparse_matrix_from_scipy_csr_matrix(x, path, compressed_sparse_matrix_chunk_size=10000, compressed_sparse_matrix_buffer_size=100000000.0, **kwargs)[source]

Method for saving csr_matrix objects to disk, see stage_object() for details.

Parameters:
  • x (csr_matrix) – Matrix to be saved.

  • path (str) – Path to a directory to save x.

  • compressed_sparse_matrix_chunk_size (int) – Chunk size for the data and indices. Larger values improve compression at the potential cost of reducing random access efficiency.

  • compressed_sparse_matrix_cache_size – Size of the buffer in bytes, for blockwise processing and writing to file. Larger values improve speed at the cost of memory.

  • kwargs – Further arguments, ignored.

Returns:

x is saved to path.

dolomite_matrix.save_delayed_array module

dolomite_matrix.save_delayed_array.save_delayed_array(x, path, delayed_array_preserve_operations=False, **kwargs)[source]

Method to save DelayedArray objects to disk, see save_object() for details.

If the array is pristine, we attempt to use the save_object method of the seed. If delayed_array_preserve_operations = False, we save the DelayedArray as a dense array or a compressed sparse matrix.

Parameters:
  • x (DelayedArray) – Object to be saved.

  • path (str) – Path to a directory to save x.

  • delayed_array_preserve_operations (bool) – Whether to preserve delayed operations via the chihaya specification. Currently not supported.

  • kwargs – Further arguments, passed to the save_object methods for dense arrays and compressed sparse matrices.

Returns:

x is saved to path.

dolomite_matrix.save_dense_array module

dolomite_matrix.save_dense_array.save_dense_array_from_ndarray(x, path, dense_array_chunk_dimensions=None, dense_array_chunk_args={}, dense_array_buffer_size=100000000.0, **kwargs)[source]

Method for saving ndarray objects to disk, see save_object() for details.

Parameters:
  • x (ndarray) – Object to be saved.

  • path (str) – Path to a directory to save x.

  • dense_array_chunk_dimensions (Optional[Tuple[int, ...]]) – Chunk dimensions for the HDF5 dataset. Larger values improve compression at the potential cost of reducing random access efficiency. If not provided, we choose some chunk sizes with choose_chunk_dimensions().

  • dense_array_chunk_args (Dict) – Arguments to pass to choose_chunk_dimensions if dense_array_chunk_dimensions is not provided.

  • dense_array_buffer_size (int) – Size of the buffer in bytes, for blockwise processing and writing to file. Larger values improve speed at the cost of memory.

  • kwargs – Further arguments, ignored.

Returns:

x is saved to path.

Module contents