Utilities for artifact upload

Utilities for uploading files to an ArtifactDB instance, used internally by uploadProject.

initializeUpload(
  dir,
  files,
  start.url,
  auto.dedup.md5 = FALSE,
  dedup.md5 = NULL,
  md5.field = "md5sum",
  dedup.link = NULL,
  expires = NULL,
  user.agent = NULL,
  override.key = NULL
)

createUploadStartURL(url, project, version)

uploadFiles(
  dir,
  url,
  initial,
  user.agent = NULL,
  attempts = 3,
  verbose = FALSE
)

completeUpload(
  url,
  initial,
  index.wait = 600,
  must.index = FALSE,
  permissions = list(),
  overwrite.permissions = FALSE,
  user.agent = NULL
)

abortUpload(url, initial, user.agent = NULL)

Arguments

dir: String containing the path to a project directory on the file system, containing files to be uploaded.
files: Character vector of paths to files to be uploaded within dir. These should be relative to dir.
start.url: String containing the URL to the upload endpoint.
auto.dedup.md5: Logical scalar indicating whether to check files for duplicates across versions based on their MD5 checksums, see Details.
dedup.md5: Named character vector containing the MD5 sums of files that are potentially duplicated in the ArtifactDB backend. Each name in the vector should be a relative path to a file in dir, while each value of the vector should be its MD5 sum. Names should not overlap with entries in files.
md5.field: String specifying the metadata field containing the MD5 checksums. If NULL, it is assumed that no such metadata field exists.
dedup.link: Named character vector specifying which files are duplicates of resources in existing ArtifactDB IDs. Each name in the vector should be a relative path to a file in dir, while each value of the vector should be the existing ArtifactDB ID. Names should not overlap with entries in files.
expires: Integer scalar specifying the number of days before the files expire and are removed from the ArtifactDB instance. By default, the files have no expiry date.
user.agent: String containing a user agent string. If NULL, a default user agent is used.
override.key: String containing an override key that allows uploads regardless of the (authenticated) user.
url: String containing the URL to the REST API.
project: String containing the project name.
version: String containing the version.
initial: The response object returned by initializeUpload, or a list generated by calling content on the response.
attempts: Integer scalar specifying the number of upload attempts on each file, in case of connection drops or timeouts. Each attempt is followed by a 1 minute wait.
verbose: Logical scalar indicating whether to report upload progress.
index.wait: Numeric scalar specifying the number of seconds to wait when checking for correct indexing.
must.index: Logical scalar indicating whether to throw an error if the indexing is not successful. If FALSE, a warning is raised if the indexing times out (but an error will still be raised if the indexing explicitly fails).
permissions: A list containing permission information, see getPermissions. This usually contains the owners and viewers character vectors.
overwrite.permissions: Logical scalar indicating whether existing permissions should be overwritten.

Value

initializeUpload will return the response object from hitting the upload endpoint. This will have already been checked for failure. If parsed as JSON, this produces a list with either presigned URL or link URL for each file, depending on whether deduplication was requested and successful.

uploadFiles will upload all files to their URLs and return NULL.

requestCompletion will return the response object from hitting the completion endpoint. This will have already been checked for failure.

abortUrl will return the response object from hitting the abort endpoint. This will not have been checked for failure, not least because this function is typically called in response to other errors during the upload; we leave it to the caller to decide whether to add another error to the trace.

createUploadStartUrl will create the “standard” upload URL to be used in start.url.

Details

Use of these utilities will almost always require appropriate authentication/authorization with the target API. Developers should ensure that identityHeaders and friends are set accordingly.

Setting expires in initializeUpload is useful for testing the upload procedure without creating a permanent copy of the files. Project versions created with expires will be automatically removed after the expiry interval, freeing up space for real data.

Metadata documents versus file artifacts

All files with names ending in .json are assumed to be metadata documents in JSON format. File artifacts should not have names ending with .json; any JSON-formatted artifacts should be renamed to avoid confusion.

For each metadata document, the JSON object should contain a path property that specifies the file artifact corresponding to this metadata. In most cases, the file artifact should have a path equal to that of the metadata document after stripping the .json extension. The only exception is when the path points to the metadata document itself, for certain “metadata-only” artifacts that have no contents.

For any metadata document that specifies a separate file artifact, it should also contain the MD5 checksum in the md5.field. This is used for integrity checks during upload and for potential deduplication (see below).

Linking to duplicate files

ArtifactDB instances are capable of creating links to represent identical files, similar to symbolic links on a typical file system. This avoids the need to explicitly store a duplicate copy of a file in another project or version. It is particularly useful when dealing with new project versions where only a subset of files have changed.

The first deduplication mechanism is implicit and based on the MD5 checksums of the uploaded (non-metadata) files. When auto.dedup.md5=TRUE or dedup.md5 is supplied, the ArtifactDB backend will check the most recent previous version of the project (if any exists) for files with the same path and checksum. If such files are found, the backend will automatically create a link to that file in the previous version, avoiding the need to upload a new copy in uploadFiles. auto.dedup.md5 will extract/compute the MD5 checksums from files, while dedup.md5 allows users to specify the MD5 checksum per file manually if this is so desired.

The second mechanism is more explicit, relying on the uploader to supply the ArtifactDB identifier of the existing file to be linked to. This involves a bit more work on the part of the uploader but is more powerful than the MD5-based method as it can be used to link files across different projects. Linked files can either be specified directly via dedup.link or they can be detected from files as placeholder symlinks generated by createPlaceholderLink.

Developers can check whether links were successfully created by inspecting the content of the initializeUpload output. If the to-be-linked file appears in the links field, the link was created; otherwise, the file must be uploaded via the presigned_urls. This is useful for double-checking that the MD5 sums were correctly recognized by the API.

Deduplication can only be performed for file artifacts, not for metadata documents ending with .json. Any attempt to deduplicate metadata will cause an error.

Author

Aaron Lun

Examples

# Creating a mock project.
src <- system.file("scripts", "mock.R", package="zircon")
source(src)
tmp <- tempfile()
createMockProject(tmp)

f <- list.files(tmp, recursive=TRUE)
f
#> [1] "blah.txt"         "blah.txt.json"    "foo/bar.txt"      "foo/bar.txt.json"
#> [5] "whee.txt"         "whee.txt.json"   
start.url <- createUploadStartURL(example.url, "test-zircon-upload", "base2")
start.url
#> [1] "https://gypsum-test.aaron-lun.workers.dev/projects/test-zircon-upload/version/base2/upload"

# Basic upload sequence, for testing:
if (FALSE) {
init <- initializeUpload(tmp, f, start.url, expires=1)
parsed <- httr::content(init)
uploadFiles(tmp, example.url, parsed)
completeUpload(example.url, parsed)
}

# Demonstrating how to create links:
actual.files <- f[!endsWith(f, ".json")]
md5 <- digest::digest(file=file.path(tmp, actual.files[1]))
names(md5) <- actual.files[1]

explicit <- packID(example.project, actual.files[2], example.version)
names(explicit) <- actual.files[2]

if (FALSE) {
start.url <- createUploadStartURL(example.url, "test-zircon-upload", "relinked")
init <- initializeUpload(tmp, start.url=start.url, expires=1,
    files=setdiff(f, c(names(md5), names(explicit))),
    dedup.md5=md5, dedup.md5.field="md5sum", dedup.link=explicit)
parsed <- httr::content(init)
uploadFiles(tmp, example.url, parsed)
completeUpload(example.url, parsed)
}