Utilities for deduplicating objects inside saveObject to save time and/or storage space.
createDedupSession()
checkObjectInDedupSession(x, session)
addObjectToDedupSession(x, session, path)Some object, typically S4.
Session object created by createDedupSession.
String containing the absolute path to the directory in which x is to be saved.
This will be used for deduplication if another object is identified as a duplicate of x.
If a relative path is provided, it will be converted to an absolute path.
createDedupSession will return a deduplication session that can be modified in-place.
If x is a duplicate of an object in session, checkObjectInDedupSession will return a string containing the absolute path to a directory representing that object.
Otherwise, it will return NULL.
addObjectToDedupSession will add x to session with the supplied path.
It returns NULL invisibly.
These utilities allow extension developers to support deduplication of objects in a top-level call to saveObject.
For a given saveObject method, we can:
Accept a session object in an optional <PREFIX>.dedup.session= argument.
We may also accept a <PREFIX>.dedup.action= argument to specify how any deduplication should be performed.
Some <PREFIX> prefix should be chosen to avoid conflicts between multiple deduplication sessions.
If a session argument is provided, we call checkObjectInDedupSession(x, session) to see if the x is a duplicate of an existing object in the session.
If a path is returned, we call cloneDirectory and return.
Otherwise, we save this object to disk, possibly passing the session argument as <PREFIX>.dedup.session= in further calls to saveObject for child objects.
We call addObjectToDedupSession to add the current object to the session.
A user can enable deduplication by passing the output of createDedupSession to <PREFIX>.dedup.session= in the top-level call to saveObject.
This is most typically performed when saving SummarizedExperiment objects with multiple assays, where one assay consists of delayed operations on another assay.
test <- function(x, path, test.dedup.session=NULL, test.dedup.action="link") {
if (!is.null(test.dedup.session)) {
original <- checkObjectInDedupSession(x, test.dedup.session)
if (!is.null(original)) {
cloneDirectory(original, path, test.dedup.action)
return(invisible(NULL))
}
}
dir.create(path)
saveRDS(x, file.path(path, "whee.rds")) # replace this with actual saving code.
if (!is.null(test.dedup.session)) {
addObjectToDedupSession(x, test.dedup.session, path)
}
}
library(S4Vectors)
y <- DataFrame(A=1:10, B=1:10)
tmp <- tempfile()
dir.create(tmp)
# Saving the first instance of the object, which is now stored in the session.
session <- createDedupSession()
checkObjectInDedupSession(y, session) # no duplicates yet.
#> NULL
test(y, file.path(tmp, "first"), test.dedup.session=session)
# Saving it again will trigger the deduplication.
checkObjectInDedupSession(y, session)
#> [1] "/tmp/Rtmp3KXhdh/file2e26d59c909/first"
test(y, file.path(tmp, "duplicate"), test.dedup.session=session)
list.files(tmp, recursive=TRUE)
#> [1] "duplicate/whee.rds" "first/whee.rds"