Introduction

The gobbler package implements an R client for the service of the same name. This allows users to easily store and read data on a managed registry in a shared filesystem, e.g., on high-performance computing (HPC) clusters. It also provides mechanisms for project maintainers to easily manage upload authorizations and third-party contributions. Readers are referred to Gobbler documentation for a description of the concepts; this guide will strictly focus on the usage of the gobbler package.

Demonstration setup

Most of this vignette assumes that a Gobbler service is running on the same filesystem as the users of the gobbler package. This is typically done with service managers like systemctl with fixed locations for the staging directory and registry. For demonstration purposes, we’ll set up a test instance of the service with some temporary paths for the staging directory and registry.

library(gobbler)
info <- startGobbler()
STAGING <- info$staging
REGISTRY <- info$registry
URL <- info$url

As we’re the administrators of this instance, we can just create a new project outselves:

createProject("test", staging=STAGING, url=URL)

We’ll also upload a new project to the registry, just for demonstration purposes.

src <- allocateUploadDirectory(STAGING)
write(file=file.path(src, "foo"), "BAR")
dir.create(file.path(src, "whee"))
write(file=file.path(src, "whee", "blah"), "stuff")
write(file=file.path(src, "whee2"), "more-stuff")
uploadDirectory("test", "simple", "v1", src, staging=STAGING, url=URL)

Reading files

gobbler provides several convenience methods for examining the Gobbler registry:

listAssets("test", registry=REGISTRY)
## [1] "simple"
listVersions("test", "simple", registry=REGISTRY)
## [1] "v1"
listFiles("test", "simple", "v1", registry=REGISTRY)
## [1] "..manifest" "..summary"  "foo"        "whee/blah"  "whee2"
versionPath("test", "simple", "v1", registry=REGISTRY)
## [1] "/tmp/Rtmp7r2wjo/file42d3094e9b1/test/simple/v1"

We can fetch the summaries and manifests for each version of a project’s assets.

fetchManifest("test", "simple", "v1", registry=REGISTRY)
## $foo
## $foo$size
## [1] 4
## 
## $foo$md5sum
## [1] "f98bf6f12e995a053b7647b10d937912"
## 
## 
## $`whee/blah`
## $`whee/blah`$size
## [1] 6
## 
## $`whee/blah`$md5sum
## [1] "9eb84090956c484e32cb6c08455a667b"
## 
## 
## $whee2
## $whee2$size
## [1] 11
## 
## $whee2$md5sum
## [1] "aeed28071296d9424ea4b7eee861386c"
fetchSummary("test", "simple", "v1", registry=REGISTRY)
## $upload_user_id
## [1] "root"
## 
## $upload_start
## [1] "2024-09-16 04:43:07 UTC"
## 
## $upload_finish
## [1] "2024-09-16 04:43:07 UTC"

We can get the latest version of an asset:

fetchLatest("test", "simple", registry=REGISTRY)
## [1] "v1"

We can also obtain a path to the contents of a versioned asset, a subdirectory of that asset, or an individual file in the registry:

versionPath("test", "simple", "v1", registry=REGISTRY)
## [1] "/tmp/Rtmp7r2wjo/file42d3094e9b1/test/simple/v1"
fetchDirectory("test/simple/v1/whee", registry=REGISTRY)
## [1] "/tmp/Rtmp7r2wjo/file42d3094e9b1/test/simple/v1/whee"
fetchFile("test/simple/v1/whee/blah", registry=REGISTRY)
## [1] "/tmp/Rtmp7r2wjo/file42d3094e9b1/test/simple/v1/whee/blah"

The functions in this section of the vignette can also be executed remotely (i.e., outside of the shared filesystem) if the url= argument is supplied. This provides read-only access to the registry for a wider range of applications, e.g., on the cloud or in the browser.

# Forcing remote access to stop it from using the filesystem.
fetchLatest('test', 'simple', registry=REGISTRY, url=URL, forceRemote=TRUE)
## [1] "v1"

Uploading files

Basic usage

To demonstrate, let’s say we have some files that we wish to upload to the registry. We allocate a subdirectory in the staging directory and we put our files into it.

# Not strictly necessary to use allocateUploadDirectory here, but it avoids an
# extra link/copy step in uploadDirectory, so we might as well use it. 
tmp <- allocateUploadDirectory(STAGING)

write(file=file.path(tmp, "foo"), letters)
write(file=file.path(tmp, "bar"), LETTERS)
write(file=file.path(tmp, "whee"), 1:10)

Then we run uploadDirectory() with the specified parameters.

uploadDirectory(
    project="test",
    asset="new_asset",
    version="new_version",
    directory=tmp,
    staging=STAGING,
    url=URL
)

# Check that it was actually uploaded to the registry:
fetchManifest("test", "new_asset", "new_version", registry=REGISTRY)
## $bar
## $bar$size
## [1] 52
## 
## $bar$md5sum
## [1] "0eb827652a5c272e1c82002f1c972018"
## 
## 
## $foo
## $foo$size
## [1] 52
## 
## $foo$md5sum
## [1] "b7fdd99fac291c4bbf958d9aee731951"
## 
## 
## $whee
## $whee$size
## [1] 21
## 
## $whee$md5sum
## [1] "c34bcea9f2263deb3379103c9f10c130"

More advanced developers can improve efficiency by explicitly deduplicating files in their upload directories. This is achieved by creating symbolic links to existing files in the Gobbler registry. The Gobbler will automatically recognize symbolic links in the upload directory that target files in the registry, and avoid creating extra copies of those files. The Gobbler will also automatically attempt to deduplicate files that are the same across consecutive versions of the same asset, based on the file size and MD5 checksums.

This capability is particularly useful when creating new versions of existing assets. Only the modified files need to be uploaded, while the rest of the files can be linked to their counterparts in the previous version. In fact, this pattern is so common that it can be expedited via cloneVersion():

dest <- allocateUploadDirectory(STAGING)
cloneVersion("test", "simple", "v1", destination=dest, registry=REGISTRY)

# Do some modifications in 'dest' to create a new version, e.g., add a file.
# However, users should treat symlink targets as read-only - so if you want to
# modify a file, instead delete the symlink and replace it with a new file.
write(file=file.path(dest, "BFFs"), c("Aaron", "Jayaram"))

Then we can just pass this directory back to uploadDirectory():

init <- uploadDirectory(
    project="test",
    asset="links",
    version="whee",
    directory=dest,
    staging=STAGING,
    url=URL
)

# Automatically converts the cloned files into links.
mann <- fetchManifest("test", "links", "whee", registry=REGISTRY)
mann[["foo"]]$link
## $project
## [1] "test"
## 
## $asset
## [1] "simple"
## 
## $version
## [1] "v1"
## 
## $path
## [1] "foo"

Changing permissions

Upload authorization is determined by each project’s permissions, which are controlled by project owners. Both uploaders and owners are identified based on their user IDs (UIDs):

fetchPermissions("test", REGISTRY)
## $owners
## $owners[[1]]
## [1] "root"
## 
## 
## $uploaders
## list()

Owners can add more uploaders (or owners) via the setPermissions() function. Uploaders can be scoped to individual assets or versions, and an expiry date may be attached to each authorization:

setPermissions(
    "test", 
    uploaders=list(
        list(
            id="jkanche", 
            until=Sys.time() + 24 * 60 * 60,
            asset="jays-happy-fun-time",
            version="1"
        )
    ),
    staging=STAGING,
    url=URL,
    registry=REGISTRY    
)

fetchPermissions("test", REGISTRY)
## $owners
## $owners[[1]]
## [1] "root"
## 
## 
## $uploaders
## $uploaders[[1]]
## $uploaders[[1]]$id
## [1] "jkanche"
## 
## $uploaders[[1]]$asset
## [1] "jays-happy-fun-time"
## 
## $uploaders[[1]]$version
## [1] "1"
## 
## $uploaders[[1]]$until
## [1] "2024-09-17 04:43:08 UTC"

Probational uploads

Uploads can be defined as “probational” whereby they must be approved by the project owners before they are considered complete. Alternatively, an owner may reject an upload, which deletes all the uploaded files from the backend. This provides a mechanism for storing files that may or may not be useful without committing to long-term immutability. To demonstrate, let’s perform a probational upload:

tmp <- allocateUploadDirectory(STAGING)
write(file=file.path(tmp, "stuff"), 1:10)

init <- uploadDirectory(
    project="test",
    asset="probational",
    version="thingy",
    directory=tmp,
    staging=STAGING,
    url=URL,
    probation=TRUE
)

# Summary has the on_probation=TRUE flag.
fetchSummary("test", "probational", "thingy", REGISTRY)
## $upload_user_id
## [1] "root"
## 
## $upload_start
## [1] "2024-09-16 04:43:08 UTC"
## 
## $upload_finish
## [1] "2024-09-16 04:43:08 UTC"
## 
## $on_probation
## [1] TRUE

We can then approve (or reject) the probational status. This either clears the on_probation= flag or it deletes the version from the registry.

approveProbation("test", "probational", "thingy", staging=STAGING, url=URL)

# Flag is gone!
fetchSummary("test", "probational", "thingy", REGISTRY)
## $upload_user_id
## [1] "root"
## 
## $upload_start
## [1] "2024-09-16 04:43:08 UTC"
## 
## $upload_finish
## [1] "2024-09-16 04:43:08 UTC"

Unless specified otherwise, all uploaders in the permissions are considered to be “untrusted”, and any uploads from such untrusted users are considered probational. This allows project maintainers to manage third-party contributions that may need several rounds of revision before approval. An uploader can be trusted by setting trusted=TRUE in setPermissions().

Inspecting the quota

Currently, gypsum-like storage quotas are not yet enforced by the Gobbler. Nonetheless, the Gobbler keeps track of the current disk usage of the project:

fetchUsage("test", REGISTRY)
## [1] 181

Administration

Administrators are responsible for creating new projects with the relevant permissions. Authorized users can then upload new versions/assets via uploadDirectory() to this project.

createProject("another-project", 
    staging=STAGING, 
    url=URL,
    owners="LTLA", 
    uploaders=list(list(id="jkanche"))
)

Administrators of the Gobbler service can manually refresh the latest version for an asset and the disk usage for a project. This is required on very rare occasions where there are simultaneous uploads to the same project.

refreshLatest("test", "simple", staging=STAGING, url=URL)
refreshUsage("test", staging=STAGING, url=URL)

Administrators may also delete projects, assets or versions, though this should be done sparingly as it violates the Gobbler’s expectations of immutability.

removeProject("test", staging=STAGING, url=URL)

Session information

## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] gobbler_0.3.7    BiocStyle_2.33.1
## 
## loaded via a namespace (and not attached):
##  [1] cli_3.6.3           knitr_1.48          rlang_1.1.4        
##  [4] xfun_0.47           textshaping_0.4.0   jsonlite_1.8.8     
##  [7] glue_1.7.0          htmltools_0.5.8.1   ragg_1.3.3         
## [10] sass_0.4.9          rappdirs_0.3.3      rmarkdown_2.28     
## [13] evaluate_0.24.0     jquerylib_0.1.4     fastmap_1.2.0      
## [16] yaml_2.3.10         lifecycle_1.0.4     httr2_1.0.4        
## [19] bookdown_0.40       BiocManager_1.30.25 compiler_4.4.1     
## [22] fs_1.6.4            htmlwidgets_1.6.4   systemfonts_1.1.0  
## [25] digest_0.6.37       R6_2.5.1            curl_5.2.2         
## [28] magrittr_2.0.3      bslib_0.8.0         tools_4.4.1        
## [31] pkgdown_2.1.0       cachem_1.1.0        desc_1.4.3