upload-utils.Rd
Utilities for uploading files to an ArtifactDB instance, used internally by uploadProject
.
initializeUpload(
dir,
files,
start.url,
auto.dedup.md5 = FALSE,
dedup.md5 = NULL,
md5.field = "md5sum",
dedup.link = NULL,
expires = NULL,
user.agent = NULL,
override.key = NULL
)
createUploadStartURL(url, project, version)
uploadFiles(
dir,
url,
initial,
user.agent = NULL,
attempts = 3,
verbose = FALSE
)
completeUpload(
url,
initial,
index.wait = 600,
must.index = FALSE,
permissions = list(),
overwrite.permissions = FALSE,
user.agent = NULL
)
abortUpload(url, initial, user.agent = NULL)
String containing the path to a project directory on the file system, containing files to be uploaded.
Character vector of paths to files to be uploaded within dir
.
These should be relative to dir
.
String containing the URL to the upload endpoint.
Logical scalar indicating whether to check files
for duplicates across versions based on their MD5 checksums, see Details.
Named character vector containing the MD5 sums of files that are potentially duplicated in the ArtifactDB backend.
Each name in the vector should be a relative path to a file in dir
, while each value of the vector should be its MD5 sum.
Names should not overlap with entries in files
.
String specifying the metadata field containing the MD5 checksums.
If NULL
, it is assumed that no such metadata field exists.
Named character vector specifying which files are duplicates of resources in existing ArtifactDB IDs.
Each name in the vector should be a relative path to a file in dir
, while each value of the vector should be the existing ArtifactDB ID.
Names should not overlap with entries in files
.
Integer scalar specifying the number of days before the files expire and are removed from the ArtifactDB instance. By default, the files have no expiry date.
String containing a user agent string.
If NULL
, a default user agent is used.
String containing an override key that allows uploads regardless of the (authenticated) user.
String containing the URL to the REST API.
String containing the project name.
String containing the version.
The response object returned by initializeUpload
, or a list generated by calling content
on the response.
Integer scalar specifying the number of upload attempts on each file, in case of connection drops or timeouts. Each attempt is followed by a 1 minute wait.
Logical scalar indicating whether to report upload progress.
Numeric scalar specifying the number of seconds to wait when checking for correct indexing.
Logical scalar indicating whether to throw an error if the indexing is not successful.
If FALSE
, a warning is raised if the indexing times out (but an error will still be raised if the indexing explicitly fails).
A list containing permission information, see getPermissions
.
This usually contains the owners
and viewers
character vectors.
Logical scalar indicating whether existing permissions should be overwritten.
initializeUpload
will return the response
object from hitting the upload endpoint.
This will have already been checked for failure.
If parsed as JSON, this produces a list with either presigned URL or link URL for each file, depending on whether deduplication was requested and successful.
uploadFiles
will upload all files to their URLs and return NULL
.
requestCompletion
will return the response
object from hitting the completion endpoint.
This will have already been checked for failure.
abortUrl
will return the response
object from hitting the abort endpoint.
This will not have been checked for failure, not least because this function is typically called in response to other errors during the upload;
we leave it to the caller to decide whether to add another error to the trace.
createUploadStartUrl
will create the “standard” upload URL to be used in start.url
.
Use of these utilities will almost always require appropriate authentication/authorization with the target API.
Developers should ensure that identityHeaders
and friends are set accordingly.
Setting expires
in initializeUpload
is useful for testing the upload procedure without creating a permanent copy of the files.
Project versions created with expires
will be automatically removed after the expiry interval, freeing up space for real data.
All files with names ending in .json
are assumed to be metadata documents in JSON format.
File artifacts should not have names ending with .json
; any JSON-formatted artifacts should be renamed to avoid confusion.
For each metadata document, the JSON object should contain a path
property that specifies the file artifact corresponding to this metadata.
In most cases, the file artifact should have a path equal to that of the metadata document after stripping the .json
extension.
The only exception is when the path
points to the metadata document itself, for certain “metadata-only” artifacts that have no contents.
For any metadata document that specifies a separate file artifact, it should also contain the MD5 checksum in the md5.field
.
This is used for integrity checks during upload and for potential deduplication (see below).
ArtifactDB instances are capable of creating links to represent identical files, similar to symbolic links on a typical file system. This avoids the need to explicitly store a duplicate copy of a file in another project or version. It is particularly useful when dealing with new project versions where only a subset of files have changed.
The first deduplication mechanism is implicit and based on the MD5 checksums of the uploaded (non-metadata) files.
When auto.dedup.md5=TRUE
or dedup.md5
is supplied, the ArtifactDB backend will check the most recent previous version of the project (if any exists) for files with the same path and checksum.
If such files are found, the backend will automatically create a link to that file in the previous version, avoiding the need to upload a new copy in uploadFiles
.
auto.dedup.md5
will extract/compute the MD5 checksums from files
, while dedup.md5
allows users to specify the MD5 checksum per file manually if this is so desired.
The second mechanism is more explicit, relying on the uploader to supply the ArtifactDB identifier of the existing file to be linked to.
This involves a bit more work on the part of the uploader but is more powerful than the MD5-based method as it can be used to link files across different projects.
Linked files can either be specified directly via dedup.link
or they can be detected from files
as placeholder symlinks generated by createPlaceholderLink
.
Developers can check whether links were successfully created by inspecting the content
of the initializeUpload
output.
If the to-be-linked file appears in the links
field, the link was created; otherwise, the file must be uploaded via the presigned_urls
.
This is useful for double-checking that the MD5 sums were correctly recognized by the API.
Deduplication can only be performed for file artifacts, not for metadata documents ending with .json
.
Any attempt to deduplicate metadata will cause an error.
uploadProject
, for a more user-friendly wrapper around these utilities.
# Creating a mock project.
src <- system.file("scripts", "mock.R", package="zircon")
source(src)
tmp <- tempfile()
createMockProject(tmp)
f <- list.files(tmp, recursive=TRUE)
f
#> [1] "blah.txt" "blah.txt.json" "foo/bar.txt" "foo/bar.txt.json"
#> [5] "whee.txt" "whee.txt.json"
start.url <- createUploadStartURL(example.url, "test-zircon-upload", "base2")
start.url
#> [1] "https://gypsum-test.aaron-lun.workers.dev/projects/test-zircon-upload/version/base2/upload"
# Basic upload sequence, for testing:
if (FALSE) {
init <- initializeUpload(tmp, f, start.url, expires=1)
parsed <- httr::content(init)
uploadFiles(tmp, example.url, parsed)
completeUpload(example.url, parsed)
}
# Demonstrating how to create links:
actual.files <- f[!endsWith(f, ".json")]
md5 <- digest::digest(file=file.path(tmp, actual.files[1]))
names(md5) <- actual.files[1]
explicit <- packID(example.project, actual.files[2], example.version)
names(explicit) <- actual.files[2]
if (FALSE) {
start.url <- createUploadStartURL(example.url, "test-zircon-upload", "relinked")
init <- initializeUpload(tmp, start.url=start.url, expires=1,
files=setdiff(f, c(names(md5), names(explicit))),
dedup.md5=md5, dedup.md5.field="md5sum", dedup.link=explicit)
parsed <- httr::content(init)
uploadFiles(tmp, example.url, parsed)
completeUpload(example.url, parsed)
}