Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most efficient way to detect changes of a remote CMIS repository?

Tags:

cmis

A remote CMIS repository contains many folders/files.

I am writing a software that keeps a local copy of these folders/files in sync.

  1. At first run I just download everything recursively.
  2. At later runs, I check what has changed, and download any changes.

What is the most efficient way to check the remote changes?
(additional/removal of files/folders)
Most efficient = Least bandwidth usage.

I can only use the CMIS protocol, and I can not run any custom software on the remote server.

My ideas so far:

  • Idea 1: Re-download everthing every time.
  • Idea 2: Check the root folder's modification date, hoping modification dates are recursive.
  • Idea 3: Use CMIS search to find all files that are more recent than the last time I synchronized. Problem: that won't tell me which files have been removed.

Any other ideas?
I don't know the CMIS protocol much, there might be something more convenient.

like image 465
Nicolas Raoul Avatar asked Dec 27 '22 18:12

Nicolas Raoul


2 Answers

A more ideal version of idea 3 is easily accomplished according to some digging through the CMIS protocol you posted.

2.1.11 Change Log

CMIS provides a “change log” mechanism to allow applications to easily discover the set of changes that have occurred to objects stored in the repository since a previous point in time. This change log can then be used by applications such as search services that maintain an external index of the repository to efficiently determine how to synchronize their index to the current state of the repository (rather than having to query for all objects currently in the repository).

Entries recorded in the change log are referred to below as “change events”.

Note that change events in the change log MUST be returned in ascending order from the time when the change event occurred.

Using whatever tools of your choice, you should be able to do an initial pull of the entire repository and save the time the pull was performed. Subsequent queries to the repository (at an interval of your choosing) are done with the following procedure:

  • Pull down the CMIS changelog from the repository
  • Parse all changes created after the previous pulls
  • Perform operations based on the ChangeType enum: for example, if the "deleted" enum is present for an objectID, delete that object locally.
like image 24
Ellipson Avatar answered Jun 03 '23 20:06

Ellipson


Using the repository's change log is the right way to go, but realize that not every repository supports this. For example, for Alfresco you must configure the audit sub-system and you must set audit.cmischangelog.enabled=true in alfresco-global.properties.

To find out if your repo supports changes you can look as the results of the repository's getCapabilities response. If you see 'Changes' set to 'None' then your repository doesn't support change logs.

Assuming it does, you need to ask the repository for its latest change log token. You can get that from getRepositoryInfo. Save that before you call getContentChanges. Then, on the next call, pass in the token. You'll get the changes made since the token was issued.

So, your code needs to:

  1. Check getCapabilities for something other than Changes = None
  2. Save the getRepositoryInfo's latestChangeLogToken
  3. The first time you ask, call getContentChanges with no arguments
  4. The next time you ask, call getcontentChanges with the last saved token
  5. You can then process the result set. Each change log entry tells you its type (created, updated, deleted, permissions, etc., see spec for exact values) and provides the cmis:objectId of the changed object.
  6. Repeat with step 2.

I have a "cmis-sync" script that does one-way synchronization using this approach implemented in Python. I've tested it against Alfresco as the source and the OpenCMIS InMemory repository as the target. If there is interest I can make it available.

like image 148
Jeff Potts Avatar answered Jun 03 '23 19:06

Jeff Potts