userguide.Rmd
zircon implements a simple R client for interacting with an ArtifactDB REST API. It provides functions to download/upload files and metadata given the API’s URL. Developers can also inject authentication and caching mechanisms to build application-specific interfaces on top of zircon.
In this vignette, we’ll be demonstrating the functionality of zircon using our test ArtifactDB URL and a sample identifier.
library(zircon)
example.url
## [1] "https://gypsum-test.aaron-lun.workers.dev"
example.id
## [1] "test-public:blah.txt@base"
The “ArtifactDB identifier” for a file takes the form of
<PROJECT>:<PATH>@<VERSION>
.
PROJECT
defines the name of the project.PATH
defines a path to a file within the project,
possibly in a subdirectory.VERSION
defines the version of the project.We can process IDs using the unpackID
and
packID
utilities.
unpacked <- unpackID(example.id)
unpacked
## $project
## [1] "test-public"
##
## $path
## [1] "blah.txt"
##
## $version
## [1] "base"
do.call(packID, unpacked)
## [1] "test-public:blah.txt@base"
The getFile()
function does as it says, downloading the
file from the API. Note that this uses download.file()
so
setting options(timeout=...)
may be required for larger
files.
## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
## [20] "T" "U" "V" "W" "X" "Y" "Z"
Similarly, getFileMetadata()
retrieves the file’s
metadata.
meta <- getFileMetadata(example.id, url=example.url)
str(meta)
## List of 5
## $ $schema : chr "generic_file/v1.json"
## $ generic_file:List of 1
## ..$ format: chr "text"
## $ md5sum : chr "0eb827652a5c272e1c82002f1c972018"
## $ path : chr "blah.txt"
## $ _extra :List of 10
## ..$ $schema : chr "generic_file/v1.json"
## ..$ id : chr "test-public:blah.txt@base"
## ..$ project_id : chr "test-public"
## ..$ version : chr "base"
## ..$ metapath : chr "blah.txt"
## ..$ meta_indexed : chr "2022-10-12T19:23:40.530Z"
## ..$ meta_uploaded: chr "2022-10-12T19:23:17.912Z"
## ..$ uploaded : chr "2022-10-12T19:23:17.912Z"
## ..$ uploader_name: chr "ArtifactDB-bot"
## ..$ permissions :List of 5
## .. ..$ scope : chr "project"
## .. ..$ read_access : chr "public"
## .. ..$ write_access: chr "owners"
## .. ..$ owners : chr "ArtifactDB-bot"
## .. ..$ viewers : list()
Similarly, we can retrieve the metadata for all files in a given project:
proj <- getProjectMetadata(example.project, url=example.url)
length(proj) # number of files
## [1] 6
ArtifactDB APIs recognize the "latest"
version alias,
which will automatically redirect to the latest version of the
project.
id2 <- packID(unpacked$project, unpacked$path, "latest")
meta2 <- getFileMetadata(id2, url=example.url)
str(meta2) # May or may not be the same as 'meta', depending on available versions.
## List of 5
## $ $schema : chr "generic_file/v1.json"
## $ generic_file:List of 1
## ..$ format: chr "text"
## $ md5sum : chr "0eb827652a5c272e1c82002f1c972018"
## $ path : chr "blah.txt"
## $ _extra :List of 10
## ..$ $schema : chr "generic_file/v1.json"
## ..$ id : chr "test-public:blah.txt@modified"
## ..$ project_id : chr "test-public"
## ..$ version : chr "modified"
## ..$ metapath : chr "blah.txt"
## ..$ meta_indexed : chr "2022-10-12T20:07:25.980Z"
## ..$ meta_uploaded: chr "2022-10-12T20:07:10.616Z"
## ..$ uploaded : chr "2022-10-12T20:07:10.616Z"
## ..$ uploader_name: chr "ArtifactDB-bot"
## ..$ permissions :List of 5
## .. ..$ scope : chr "project"
## .. ..$ read_access : chr "public"
## .. ..$ write_access: chr "owners"
## .. ..$ owners : chr "ArtifactDB-bot"
## .. ..$ viewers : list()
While convenient for users, this is not amenable to reproducibility
as the redirection can change upon creation of a new version of the
project. Developers may wish to resolve the "latest"
alias
to ensure that the actual version is recorded in their applications.
resolveLatestID(id2, example.url)
## [1] "test-public:blah.txt@modified"
resolveLatestVersion(unpacked$project, example.url)
## [1] "modified"
To see the permissions on a project:
getPermissions(unpacked$project, example.url)
## $scope
## [1] "project"
##
## $read_access
## [1] "public"
##
## $write_access
## [1] "owners"
##
## $owners
## [1] "ArtifactDB-bot"
##
## $viewers
## character(0)
Project owners may change the permissions with
setPermissions()
, e.g.
# Set to private:
setPermissions(unpacked$project, example.url, public=FALSE)
# Add viewers:
setPermissions(unpacked$project, example.url, viewers="LTLA", action="append")
# Remove viewers:
setPermissions(unpacked$project, example.url, viewers="LTLA", action="remove")
The test API uses GitHub user names and tokens for authentication, though other deployments may use different mechanisms.
With appropriate authorization, users can upload their own projects
to an ArtifactDB API. This assumes that file artifacts have been
generated with the appropriate JSON metadata. Check out ArtifactDB/BiocObjectSchemas
for examples of file/metadata uploads that are recognized by the test
API; of course, an API maintainer may choose to use other schemas, so
users should be aware of the target API.
The upload process assumes that all to-be-uploaded files are
available within a “staging” directory. This directory corresponds to a
single version of the project; the PATH
in the ArtifactDB
identifier for each file is defined relative to the root of this
directory. Given the directory path, the simplest upload code will look
like:
# Not evaluated as this vignette builder is not appropriately authorized.
uploadProject(dir, example.url, project="test-zircon-upload", version="foo")
For finer control of the upload process, developers may prefer the following sequence of commands:
# Not evaluated.
f <- list.files(dir, recursive=TRUE)
start.url <- createUploadStartUrl(example.url, "test-zircon-upload", "bar")
info <- initializeUpload(dir, f, start.url)
parsed <- httr::content(info)
uploadFiles(dir, example.url, parsed)
completeUpload(example.url, parsed)
Some of the more interesting upload options include:
expires
: specify an expiry date for uploaded project
versions, after which they will be automatically deleted from storage.
This is useful for testing with dummy projects.dedup.md5
: hint to the ArtifactDB backend that some
files might be redundant with their counterparts in a previous version
of the project. If they have the same path and MDF5 checksum, the
backend will create a link in the new version rather than requiring a
fresh upload.dedup.link
: instruct the ArtifactDB backend that a file
is the same as another file that is already in storage (possibly from a
different project). The backend will then create a link in the current
project.permissions
: set the permissions for the project, for
new projects with no previous versions (unless
overwrite.permissions=TRUE
).Some of the getter functions can perform caching by providing an
appropriate function to cache=
. This function accepts a
key
string identifying the resource and a save
function that downloads the resource. It should check if the key already
exists in the cache - if it does, the path to the cached file can be
returned directly, otherwise it should be downloaded from the API by
running save()
.
tmp.cache <- file.path(tempdir(), "zircon-cache")
dir.create(tmp.cache)
cache.fun <- function(key, save) {
path <- file.path(tmp.cache, URLencode(key, reserved=TRUE, repeated=TRUE))
if (!file.exists(path)) {
save(path)
} else {
cat("cache hit!\n")
}
path
}
getFile(example.id, example.url, cache = cache.fun)
## [1] "/tmp/RtmpPj1Rpj/zircon-cache/https%3A%2F%2Fgypsum-test.aaron-lun.workers.dev%2Ffiles%2Ftest-public%253Ablah.txt%2540base"
# cache hit (for the file and its metadata), same file path reported:
getFile(example.id, example.url, cache = cache.fun)
## cache hit!
## cache hit!
## [1] "/tmp/RtmpPj1Rpj/zircon-cache/https%3A%2F%2Fgypsum-test.aaron-lun.workers.dev%2Ffiles%2Ftest-public%253Ablah.txt%2540base"
Note that this mechanism will automatically resolve any
latest
aliases in the supplied identifiers. This avoids
problems with stale caches after updating the project versions.
Users are strongly advised to use some kind of caching to reduce
bandwidth and data egress costs. Check out the biocCache()
function for one possible caching mechanism based on BiocFileCache.
Users can globally specify the authentication procedure by through
the identityAvailable()
and identityHeaders()
functions. To demonstrate, we’ll be using GitHub personal access tokens
to authenticate users based on GitHub’s REST API. (Note that this is
already implemented using the useGitHubIdentities()
function, but we present a simplified version below for pedagogical
purposes.)
.fetch_token_from_cache <- function(exists.only) {
dir <- tools::R_user_dir("my_app_name")
# Storing token in plaintext, which is probably fine. If you want it to be
# more secure, that's up to you.
token.path <- file.path(dir, "token.txt")
if (file.exists(token.path)) {
if (exists.only) {
return(TRUE)
} else {
# A more advanced approach might also double-check that the cached
# token is still valid before returning it.
return(readLines(token.path))
}
}
if (exists.only) {
return(FALSE)
}
msg <- "Generate a GitHub personal access token at https://github.com/settings/tokens"
token <- readline(paste0(msg, "\nTOKEN: ")) # Use something like getPass for masking.
writeLines(con=token.path, token)
token
}
We first set the identityAvailable()
function, which
indicates whether any identification information is available. This is
used to check whether an initial API request should be attempted without
any authentication, e.g., for public resources, which avoids burdening
the user with unnecessary authentication work.
identityAvailable(function() !is.null(.fetch_token_from_cache(TRUE)))
We then set the identityHeader()
function, which
generates the HTTP headers containing identity information. In this
case, the test API uses the standard Authorization
header
to retrieve the token.
identityHeaders(function() list(Authorization=paste("Bearer", .fetch_token_from_cache(FALSE))))
These headers will now be used in various zircon functions to authenticate the user where necessary.
While it is possible to use zircon directly,
it is often better to wrap zircon in another
package that handles the configuration for a particular
ArtifactDB instance. At the bare minimum, the
url=
can be set to some sensible default for a given
instance; developers can also provide appropriate defaults for caching
and authentication. This streamlines the end user experience and reduces
the potential for errors (e.g., if the user specifies the wrong URL) or
inefficiencies (e.g., if the user forgets to set cache=
).
Check out the calcite package
for an example of a wrapper around zircon.
## R Under development (unstable) (2023-01-26 r83699)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] zircon_0.98.4 BiocStyle_2.27.1
##
## loaded via a namespace (and not attached):
## [1] vctrs_0.5.2 httr_1.4.4 cli_3.6.0
## [4] knitr_1.42 rlang_1.0.6 xfun_0.36
## [7] stringi_1.7.12 purrr_1.0.1 textshaping_0.3.6
## [10] jsonlite_1.8.4 glue_1.6.2 rprojroot_2.0.3
## [13] htmltools_0.5.4 ragg_1.2.5 sass_0.4.5
## [16] rmarkdown_2.20 evaluate_0.20 jquerylib_0.1.4
## [19] fastmap_1.1.0 lifecycle_1.0.3 yaml_2.3.7
## [22] memoise_2.0.1 bookdown_0.32 BiocManager_1.30.19
## [25] stringr_1.5.0 compiler_4.3.0 fs_1.6.0
## [28] systemfonts_1.0.4 digest_0.6.31 R6_2.5.1
## [31] curl_5.0.0 magrittr_2.0.3 bslib_0.4.2
## [34] tools_4.3.0 pkgdown_2.0.7 cachem_1.0.6
## [37] desc_1.4.2