Utilities for deduplicating objects inside saveObject to save time and/or storage space.

createDedupSession()

checkObjectInDedupSession(x, session)

addObjectToDedupSession(x, session, path)

Arguments

x

Some object, typically S4.

session

Session object created by createDedupSession.

path

String containing the absolute path to the directory in which x is to be saved. This will be used for deduplication if another object is identified as a duplicate of x. If a relative path is provided, it will be converted to an absolute path.

Value

createDedupSession will return a deduplication session that can be modified in-place.

If x is a duplicate of an object in session, checkObjectInDedupSession will return a string containing the absolute path to a directory representing that object. Otherwise, it will return NULL.

addObjectToDedupSession will add x to session with the supplied path. It returns NULL invisibly.

Details

These utilities allow extension developers to support deduplication of objects in a top-level call to saveObject. For a given saveObject method, we can:

  1. Accept a session object in an optional <PREFIX>.dedup.session= argument. We may also accept a <PREFIX>.dedup.action= argument to specify how any deduplication should be performed. Some <PREFIX> prefix should be chosen to avoid conflicts between multiple deduplication sessions.

  2. If a session argument is provided, we call checkObjectInDedupSession(x, session) to see if the x is a duplicate of an existing object in the session. If a path is returned, we call cloneDirectory and return.

  3. Otherwise, we save this object to disk, possibly passing the session argument as <PREFIX>.dedup.session= in further calls to saveObject for child objects. We call addObjectToDedupSession to add the current object to the session.

A user can enable deduplication by passing the output of createDedupSession to <PREFIX>.dedup.session= in the top-level call to saveObject. This is most typically performed when saving SummarizedExperiment objects with multiple assays, where one assay consists of delayed operations on another assay.

Author

Aaron Lun

Examples

test <- function(x, path, test.dedup.session=NULL, test.dedup.action="link") {
   if (!is.null(test.dedup.session)) {
       original <- checkObjectInDedupSession(x, test.dedup.session)
       if (!is.null(original)) {
           cloneDirectory(original, path, test.dedup.action)
           return(invisible(NULL))
       }
   }
   dir.create(path)
   saveRDS(x, file.path(path, "whee.rds")) # replace this with actual saving code.
   if (!is.null(test.dedup.session)) {
       addObjectToDedupSession(x, test.dedup.session, path)
   }
}

library(S4Vectors)
y <- DataFrame(A=1:10, B=1:10)
tmp <- tempfile()
dir.create(tmp)

# Saving the first instance of the object, which is now stored in the session.
session <- createDedupSession()
checkObjectInDedupSession(y, session) # no duplicates yet.
#> NULL
test(y, file.path(tmp, "first"), test.dedup.session=session)

# Saving it again will trigger the deduplication.
checkObjectInDedupSession(y, session)
#> [1] "/tmp/Rtmp6XG2G9/file221708cffc5/first"
test(y, file.path(tmp, "duplicate"), test.dedup.session=session)

list.files(tmp, recursive=TRUE)
#> [1] "duplicate/whee.rds" "first/whee.rds"