vignettes/userguide.Rmd
userguide.Rmd
The gobbler package implements an R client for the service of the same name. This allows users to easily store and read data on a managed registry in a shared filesystem, e.g., on high-performance computing (HPC) clusters. It also provides mechanisms for project maintainers to easily manage upload authorizations and third-party contributions. Readers are referred to Gobbler documentation for a description of the concepts; this guide will strictly focus on the usage of the gobbler package.
Most of this vignette assumes that a Gobbler service is running on
the same filesystem as the users of the gobbler package.
This is typically done with service managers like systemctl
with fixed locations for the staging directory and registry. For
demonstration purposes, we’ll set up a test instance of the service with
some temporary paths for the staging directory and registry.
library(gobbler)
info <- startGobbler()
STAGING <- info$staging
REGISTRY <- info$registry
URL <- info$url
As we’re the administrators of this instance, we can just create a new project outselves:
createProject("test", staging=STAGING, url=URL)
We’ll also upload a new project to the registry, just for demonstration purposes.
src <- allocateUploadDirectory(STAGING)
write(file=file.path(src, "foo"), "BAR")
dir.create(file.path(src, "whee"))
write(file=file.path(src, "whee", "blah"), "stuff")
write(file=file.path(src, "whee2"), "more-stuff")
uploadDirectory("test", "simple", "v1", src, staging=STAGING, url=URL)
gobbler provides several convenience methods for examining the Gobbler registry:
listAssets("test", registry=REGISTRY)
## [1] "simple"
listVersions("test", "simple", registry=REGISTRY)
## [1] "v1"
listFiles("test", "simple", "v1", registry=REGISTRY)
## [1] "..manifest" "..summary" "foo" "whee/blah" "whee2"
versionPath("test", "simple", "v1", registry=REGISTRY)
## [1] "/tmp/RtmpWQTurf/file4053228a90/test/simple/v1"
We can fetch the summaries and manifests for each version of a project’s assets.
fetchManifest("test", "simple", "v1", registry=REGISTRY)
## $foo
## $foo$size
## [1] 4
##
## $foo$md5sum
## [1] "f98bf6f12e995a053b7647b10d937912"
##
##
## $`whee/blah`
## $`whee/blah`$size
## [1] 6
##
## $`whee/blah`$md5sum
## [1] "9eb84090956c484e32cb6c08455a667b"
##
##
## $whee2
## $whee2$size
## [1] 11
##
## $whee2$md5sum
## [1] "aeed28071296d9424ea4b7eee861386c"
fetchSummary("test", "simple", "v1", registry=REGISTRY)
## $upload_user_id
## [1] "root"
##
## $upload_start
## [1] "2024-10-28 00:28:30 UTC"
##
## $upload_finish
## [1] "2024-10-28 00:28:30 UTC"
We can get the latest version of an asset:
fetchLatest("test", "simple", registry=REGISTRY)
## [1] "v1"
We can also obtain a path to the contents of a versioned asset, a subdirectory of that asset, or an individual file in the registry:
versionPath("test", "simple", "v1", registry=REGISTRY)
## [1] "/tmp/RtmpWQTurf/file4053228a90/test/simple/v1"
fetchDirectory("test/simple/v1/whee", registry=REGISTRY)
## [1] "/tmp/RtmpWQTurf/file4053228a90/test/simple/v1/whee"
fetchFile("test/simple/v1/whee/blah", registry=REGISTRY)
## [1] "/tmp/RtmpWQTurf/file4053228a90/test/simple/v1/whee/blah"
The functions in this section of the vignette can also be executed
remotely (i.e., outside of the shared filesystem) if the
url=
argument is supplied. This provides read-only access
to the registry for a wider range of applications, e.g., on the cloud or
in the browser.
# Forcing remote access to stop it from using the filesystem.
fetchLatest('test', 'simple', registry=REGISTRY, url=URL, forceRemote=TRUE)
## [1] "v1"
To demonstrate, let’s say we have some files that we wish to upload to the registry. We allocate a subdirectory in the staging directory and we put our files into it.
# Not strictly necessary to use allocateUploadDirectory here, but it avoids an
# extra link/copy step in uploadDirectory, so we might as well use it.
tmp <- allocateUploadDirectory(STAGING)
write(file=file.path(tmp, "foo"), letters)
write(file=file.path(tmp, "bar"), LETTERS)
write(file=file.path(tmp, "whee"), 1:10)
Then we run uploadDirectory()
with the specified
parameters.
uploadDirectory(
project="test",
asset="new_asset",
version="new_version",
directory=tmp,
staging=STAGING,
url=URL
)
# Check that it was actually uploaded to the registry:
fetchManifest("test", "new_asset", "new_version", registry=REGISTRY)
## $bar
## $bar$size
## [1] 52
##
## $bar$md5sum
## [1] "0eb827652a5c272e1c82002f1c972018"
##
##
## $foo
## $foo$size
## [1] 52
##
## $foo$md5sum
## [1] "b7fdd99fac291c4bbf958d9aee731951"
##
##
## $whee
## $whee$size
## [1] 21
##
## $whee$md5sum
## [1] "c34bcea9f2263deb3379103c9f10c130"
More advanced developers can improve efficiency by explicitly deduplicating files in their upload directories. This is achieved by creating symbolic links to existing files in the Gobbler registry. The Gobbler will automatically recognize symbolic links in the upload directory that target files in the registry, and avoid creating extra copies of those files. The Gobbler will also automatically attempt to deduplicate files that are the same across consecutive versions of the same asset, based on the file size and MD5 checksums.
This capability is particularly useful when creating new versions of
existing assets. Only the modified files need to be uploaded, while the
rest of the files can be linked to their counterparts in the previous
version. In fact, this pattern is so common that it can be expedited via
cloneVersion()
:
dest <- allocateUploadDirectory(STAGING)
cloneVersion("test", "simple", "v1", destination=dest, registry=REGISTRY)
# Do some modifications in 'dest' to create a new version, e.g., add a file.
# However, users should treat symlink targets as read-only - so if you want to
# modify a file, instead delete the symlink and replace it with a new file.
write(file=file.path(dest, "BFFs"), c("Aaron", "Jayaram"))
Then we can just pass this directory back to
uploadDirectory()
:
init <- uploadDirectory(
project="test",
asset="links",
version="whee",
directory=dest,
staging=STAGING,
url=URL
)
# Automatically converts the cloned files into links.
mann <- fetchManifest("test", "links", "whee", registry=REGISTRY)
mann[["foo"]]$link
## $project
## [1] "test"
##
## $asset
## [1] "simple"
##
## $version
## [1] "v1"
##
## $path
## [1] "foo"
Upload authorization is determined by each project’s permissions, which are controlled by project owners. Both uploaders and owners are identified based on their user IDs (UIDs):
fetchPermissions("test", REGISTRY)
## $owners
## $owners[[1]]
## [1] "root"
##
##
## $uploaders
## list()
Owners can add more uploaders (or owners) via the
setPermissions()
function. Uploaders can be scoped to
individual assets or versions, and an expiry date may be attached to
each authorization:
setPermissions(
"test",
uploaders=list(
list(
id="jkanche",
until=Sys.time() + 24 * 60 * 60,
asset="jays-happy-fun-time",
version="1"
)
),
staging=STAGING,
url=URL,
registry=REGISTRY
)
fetchPermissions("test", REGISTRY)
## $owners
## $owners[[1]]
## [1] "root"
##
##
## $uploaders
## $uploaders[[1]]
## $uploaders[[1]]$id
## [1] "jkanche"
##
## $uploaders[[1]]$asset
## [1] "jays-happy-fun-time"
##
## $uploaders[[1]]$version
## [1] "1"
##
## $uploaders[[1]]$until
## [1] "2024-10-29 00:28:31 UTC"
Uploads can be defined as “probational” whereby they must be approved by the project owners before they are considered complete. Alternatively, an owner may reject an upload, which deletes all the uploaded files from the backend. This provides a mechanism for storing files that may or may not be useful without committing to long-term immutability. To demonstrate, let’s perform a probational upload:
tmp <- allocateUploadDirectory(STAGING)
write(file=file.path(tmp, "stuff"), 1:10)
init <- uploadDirectory(
project="test",
asset="probational",
version="thingy",
directory=tmp,
staging=STAGING,
url=URL,
probation=TRUE
)
# Summary has the on_probation=TRUE flag.
fetchSummary("test", "probational", "thingy", REGISTRY)
## $upload_user_id
## [1] "root"
##
## $upload_start
## [1] "2024-10-28 00:28:31 UTC"
##
## $upload_finish
## [1] "2024-10-28 00:28:31 UTC"
##
## $on_probation
## [1] TRUE
We can then approve (or reject) the probational status. This either
clears the on_probation=
flag or it deletes the version
from the registry.
approveProbation("test", "probational", "thingy", staging=STAGING, url=URL)
# Flag is gone!
fetchSummary("test", "probational", "thingy", REGISTRY)
## $upload_user_id
## [1] "root"
##
## $upload_start
## [1] "2024-10-28 00:28:31 UTC"
##
## $upload_finish
## [1] "2024-10-28 00:28:31 UTC"
Unless specified otherwise, all uploaders
in the
permissions are considered to be “untrusted”, and any uploads from such
untrusted users are considered probational. This allows project
maintainers to manage third-party contributions that may need several
rounds of revision before approval. An uploader can be trusted by
setting trusted=TRUE
in setPermissions()
.
Currently, gypsum-like storage quotas are not yet enforced by the Gobbler. Nonetheless, the Gobbler keeps track of the current disk usage of the project:
fetchUsage("test", REGISTRY)
## [1] 181
Administrators are responsible for creating new projects with the
relevant permissions. Authorized users can then upload new
versions/assets via uploadDirectory()
to this project.
createProject("another-project",
staging=STAGING,
url=URL,
owners="LTLA",
uploaders=list(list(id="jkanche"))
)
Administrators of the Gobbler service can manually refresh the latest version for an asset and the disk usage for a project. This is required on very rare occasions where there are simultaneous uploads to the same project.
refreshLatest("test", "simple", staging=STAGING, url=URL)
refreshUsage("test", staging=STAGING, url=URL)
Administrators may also delete projects, assets or versions, though this should be done sparingly as it violates the Gobbler’s expectations of immutability.
removeProject("test", staging=STAGING, url=URL)
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] gobbler_0.3.10 BiocStyle_2.33.1
##
## loaded via a namespace (and not attached):
## [1] cli_3.6.3 knitr_1.48 rlang_1.1.4
## [4] xfun_0.48 textshaping_0.4.0 jsonlite_1.8.9
## [7] glue_1.8.0 htmltools_0.5.8.1 ragg_1.3.3
## [10] sass_0.4.9 rappdirs_0.3.3 rmarkdown_2.28
## [13] evaluate_1.0.1 jquerylib_0.1.4 fastmap_1.2.0
## [16] yaml_2.3.10 lifecycle_1.0.4 httr2_1.0.5
## [19] bookdown_0.41 BiocManager_1.30.25 compiler_4.4.1
## [22] fs_1.6.4 htmlwidgets_1.6.4 systemfonts_1.1.0
## [25] digest_0.6.37 R6_2.5.1 curl_5.2.3
## [28] magrittr_2.0.3 bslib_0.8.0 tools_4.4.1
## [31] pkgdown_2.1.1 cachem_1.1.0 desc_1.4.3