Introduction

zircon implements a simple R client for interacting with an ArtifactDB REST API. It provides functions to download/upload files and metadata given the API’s URL. Developers can also inject authentication and caching mechanisms to build application-specific interfaces on top of zircon.

In this vignette, we’ll be demonstrating the functionality of zircon using our test ArtifactDB URL and a sample identifier.

library(zircon)
example.url
## [1] "https://gypsum-test.aaron-lun.workers.dev"
example.id
## [1] "test-public:blah.txt@base"

Identifier structure

The “ArtifactDB identifier” for a file takes the form of <PROJECT>:<PATH>@<VERSION>.

  • PROJECT defines the name of the project.
  • PATH defines a path to a file within the project, possibly in a subdirectory.
  • VERSION defines the version of the project.

We can process IDs using the unpackID and packID utilities.

unpacked <- unpackID(example.id)
unpacked
## $project
## [1] "test-public"
## 
## $path
## [1] "blah.txt"
## 
## $version
## [1] "base"
do.call(packID, unpacked)
## [1] "test-public:blah.txt@base"

Getting files and their metadata

The getFile() function does as it says, downloading the file from the API. Note that this uses download.file() so setting options(timeout=...) may be required for larger files.

path <- getFile(example.id, url=example.url)
readLines(path)
##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
## [20] "T" "U" "V" "W" "X" "Y" "Z"

Similarly, getFileMetadata() retrieves the file’s metadata.

meta <- getFileMetadata(example.id, url=example.url)
str(meta)
## List of 5
##  $ $schema     : chr "generic_file/v1.json"
##  $ generic_file:List of 1
##   ..$ format: chr "text"
##  $ md5sum      : chr "0eb827652a5c272e1c82002f1c972018"
##  $ path        : chr "blah.txt"
##  $ _extra      :List of 10
##   ..$ $schema      : chr "generic_file/v1.json"
##   ..$ id           : chr "test-public:blah.txt@base"
##   ..$ project_id   : chr "test-public"
##   ..$ version      : chr "base"
##   ..$ metapath     : chr "blah.txt"
##   ..$ meta_indexed : chr "2022-10-12T19:23:40.530Z"
##   ..$ meta_uploaded: chr "2022-10-12T19:23:17.912Z"
##   ..$ uploaded     : chr "2022-10-12T19:23:17.912Z"
##   ..$ uploader_name: chr "ArtifactDB-bot"
##   ..$ permissions  :List of 5
##   .. ..$ scope       : chr "project"
##   .. ..$ read_access : chr "public"
##   .. ..$ write_access: chr "owners"
##   .. ..$ owners      : chr "ArtifactDB-bot"
##   .. ..$ viewers     : list()

Similarly, we can retrieve the metadata for all files in a given project:

proj <- getProjectMetadata(example.project, url=example.url)
length(proj) # number of files
## [1] 6

Resolving the latest version

ArtifactDB APIs recognize the "latest" version alias, which will automatically redirect to the latest version of the project.

id2 <- packID(unpacked$project, unpacked$path, "latest")
meta2 <- getFileMetadata(id2, url=example.url)
str(meta2) # May or may not be the same as 'meta', depending on available versions.
## List of 5
##  $ $schema     : chr "generic_file/v1.json"
##  $ generic_file:List of 1
##   ..$ format: chr "text"
##  $ md5sum      : chr "0eb827652a5c272e1c82002f1c972018"
##  $ path        : chr "blah.txt"
##  $ _extra      :List of 10
##   ..$ $schema      : chr "generic_file/v1.json"
##   ..$ id           : chr "test-public:blah.txt@modified"
##   ..$ project_id   : chr "test-public"
##   ..$ version      : chr "modified"
##   ..$ metapath     : chr "blah.txt"
##   ..$ meta_indexed : chr "2022-10-12T20:07:25.980Z"
##   ..$ meta_uploaded: chr "2022-10-12T20:07:10.616Z"
##   ..$ uploaded     : chr "2022-10-12T20:07:10.616Z"
##   ..$ uploader_name: chr "ArtifactDB-bot"
##   ..$ permissions  :List of 5
##   .. ..$ scope       : chr "project"
##   .. ..$ read_access : chr "public"
##   .. ..$ write_access: chr "owners"
##   .. ..$ owners      : chr "ArtifactDB-bot"
##   .. ..$ viewers     : list()

While convenient for users, this is not amenable to reproducibility as the redirection can change upon creation of a new version of the project. Developers may wish to resolve the "latest" alias to ensure that the actual version is recorded in their applications.

resolveLatestID(id2, example.url)
## [1] "test-public:blah.txt@modified"
resolveLatestVersion(unpacked$project, example.url)
## [1] "modified"

Checking permissions

To see the permissions on a project:

getPermissions(unpacked$project, example.url)
## $scope
## [1] "project"
## 
## $read_access
## [1] "public"
## 
## $write_access
## [1] "owners"
## 
## $owners
## [1] "ArtifactDB-bot"
## 
## $viewers
## character(0)

Project owners may change the permissions with setPermissions(), e.g.

# Set to private:
setPermissions(unpacked$project, example.url, public=FALSE)

# Add viewers:
setPermissions(unpacked$project, example.url, viewers="LTLA", action="append")

# Remove viewers:
setPermissions(unpacked$project, example.url, viewers="LTLA", action="remove")

The test API uses GitHub user names and tokens for authentication, though other deployments may use different mechanisms.

Uploading new projects

With appropriate authorization, users can upload their own projects to an ArtifactDB API. This assumes that file artifacts have been generated with the appropriate JSON metadata. Check out ArtifactDB/BiocObjectSchemas for examples of file/metadata uploads that are recognized by the test API; of course, an API maintainer may choose to use other schemas, so users should be aware of the target API.

The upload process assumes that all to-be-uploaded files are available within a “staging” directory. This directory corresponds to a single version of the project; the PATH in the ArtifactDB identifier for each file is defined relative to the root of this directory. Given the directory path, the simplest upload code will look like:

# Not evaluated as this vignette builder is not appropriately authorized.
uploadProject(dir, example.url, project="test-zircon-upload", version="foo")

For finer control of the upload process, developers may prefer the following sequence of commands:

# Not evaluated.
f <- list.files(dir, recursive=TRUE)
start.url <- createUploadStartUrl(example.url, "test-zircon-upload", "bar")
info <- initializeUpload(dir, f, start.url)
parsed <- httr::content(info)
uploadFiles(dir, example.url, parsed)
completeUpload(example.url, parsed)

Some of the more interesting upload options include:

  • expires: specify an expiry date for uploaded project versions, after which they will be automatically deleted from storage. This is useful for testing with dummy projects.
  • dedup.md5: hint to the ArtifactDB backend that some files might be redundant with their counterparts in a previous version of the project. If they have the same path and MDF5 checksum, the backend will create a link in the new version rather than requiring a fresh upload.
  • dedup.link: instruct the ArtifactDB backend that a file is the same as another file that is already in storage (possibly from a different project). The backend will then create a link in the current project.
  • permissions: set the permissions for the project, for new projects with no previous versions (unless overwrite.permissions=TRUE).

Caching API responses

Some of the getter functions can perform caching by providing an appropriate function to cache=. This function accepts a key string identifying the resource and a save function that downloads the resource. It should check if the key already exists in the cache - if it does, the path to the cached file can be returned directly, otherwise it should be downloaded from the API by running save().

tmp.cache <- file.path(tempdir(), "zircon-cache")
dir.create(tmp.cache)
cache.fun <- function(key, save) {
    path <- file.path(tmp.cache, URLencode(key, reserved=TRUE, repeated=TRUE))
    if (!file.exists(path)) {
        save(path)
    } else {
        cat("cache hit!\n")
    }
    path
}

getFile(example.id, example.url, cache = cache.fun)
## [1] "/tmp/RtmpPj1Rpj/zircon-cache/https%3A%2F%2Fgypsum-test.aaron-lun.workers.dev%2Ffiles%2Ftest-public%253Ablah.txt%2540base"
# cache hit (for the file and its metadata), same file path reported:
getFile(example.id, example.url, cache = cache.fun) 
## cache hit!
## cache hit!
## [1] "/tmp/RtmpPj1Rpj/zircon-cache/https%3A%2F%2Fgypsum-test.aaron-lun.workers.dev%2Ffiles%2Ftest-public%253Ablah.txt%2540base"

Note that this mechanism will automatically resolve any latest aliases in the supplied identifiers. This avoids problems with stale caches after updating the project versions.

Users are strongly advised to use some kind of caching to reduce bandwidth and data egress costs. Check out the biocCache() function for one possible caching mechanism based on BiocFileCache.

Customizing the authentication

Users can globally specify the authentication procedure by through the identityAvailable() and identityHeaders() functions. To demonstrate, we’ll be using GitHub personal access tokens to authenticate users based on GitHub’s REST API. (Note that this is already implemented using the useGitHubIdentities() function, but we present a simplified version below for pedagogical purposes.)

.fetch_token_from_cache <- function(exists.only) {
    dir <- tools::R_user_dir("my_app_name")

    # Storing token in plaintext, which is probably fine. If you want it to be
    # more secure, that's up to you.
    token.path <- file.path(dir, "token.txt")

    if (file.exists(token.path)) {
        if (exists.only) {
            return(TRUE)
        } else {
            # A more advanced approach might also double-check that the cached
            # token is still valid before returning it.
            return(readLines(token.path))
        }
    }

    if (exists.only) {
        return(FALSE)
    }

    msg <- "Generate a GitHub personal access token at https://github.com/settings/tokens"
    token <- readline(paste0(msg, "\nTOKEN: ")) # Use something like getPass for masking.
    writeLines(con=token.path, token)
    token
}

We first set the identityAvailable() function, which indicates whether any identification information is available. This is used to check whether an initial API request should be attempted without any authentication, e.g., for public resources, which avoids burdening the user with unnecessary authentication work.

identityAvailable(function() !is.null(.fetch_token_from_cache(TRUE)))

We then set the identityHeader() function, which generates the HTTP headers containing identity information. In this case, the test API uses the standard Authorization header to retrieve the token.

identityHeaders(function() list(Authorization=paste("Bearer", .fetch_token_from_cache(FALSE))))

These headers will now be used in various zircon functions to authenticate the user where necessary.

Recommendations for application developers

While it is possible to use zircon directly, it is often better to wrap zircon in another package that handles the configuration for a particular ArtifactDB instance. At the bare minimum, the url= can be set to some sensible default for a given instance; developers can also provide appropriate defaults for caching and authentication. This streamlines the end user experience and reduces the potential for errors (e.g., if the user specifies the wrong URL) or inefficiencies (e.g., if the user forgets to set cache=). Check out the calcite package for an example of a wrapper around zircon.

Session information

## R Under development (unstable) (2023-01-26 r83699)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] zircon_0.98.4    BiocStyle_2.27.1
## 
## loaded via a namespace (and not attached):
##  [1] vctrs_0.5.2         httr_1.4.4          cli_3.6.0          
##  [4] knitr_1.42          rlang_1.0.6         xfun_0.36          
##  [7] stringi_1.7.12      purrr_1.0.1         textshaping_0.3.6  
## [10] jsonlite_1.8.4      glue_1.6.2          rprojroot_2.0.3    
## [13] htmltools_0.5.4     ragg_1.2.5          sass_0.4.5         
## [16] rmarkdown_2.20      evaluate_0.20       jquerylib_0.1.4    
## [19] fastmap_1.1.0       lifecycle_1.0.3     yaml_2.3.7         
## [22] memoise_2.0.1       bookdown_0.32       BiocManager_1.30.19
## [25] stringr_1.5.0       compiler_4.3.0      fs_1.6.0           
## [28] systemfonts_1.0.4   digest_0.6.31       R6_2.5.1           
## [31] curl_5.0.0          magrittr_2.0.3      bslib_0.4.2        
## [34] tools_4.3.0         pkgdown_2.0.7       cachem_1.0.6       
## [37] desc_1.4.2