Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data deduplication framework?

I want to integrate data deduplication into software that I am writing to back up vmware images. I haven't been able to find anything suitable for what I think I need. There seem to be a LOT of complete solutions that include one form of deduplication or another. These include storage or backup solutions that use public or private clouds, specialized file systems, storage networks or devices, etc. However, I need to develop my own solution and integrate dedupe into that. My software will be written in C# and I would like to be able to call an API to tell it what to dedupe.

The type of deduplication I am talking about is not deduping one image against another image--typically the approach for producing incremental or differential backups of two "versions" of something--or what is called "Client backup deduplication" in the Wikipedia entry on data deduplication, as I already have a solution to do that, and want to take things a step further.

I envisage an approach that would allow me to dedupe chunks of data somehow on a global level (i.e. some form of global deduplication). To be global, I imagine there would be a central lookup table of some sort (e.g. an index of hashes) that would tell the deduper that a copy of the data being examined is already held and does not need to be stored again. The chunks could be file-level (Single Instance Storage or SIS) or sub-file-/block-level deduplication. The latter should be more efficient (which is more important for our purposes than, say, processing overhead) and would be my preferred option, but I could make SIS work too, if I had to.

I have now done a lot of reading about other people's software that do deduping, as I mentioned above. I won't cite examples here because I am not trying to emulate anyone else's approach specifically. Rather, I haven't been able to find a programmers' solution and want to know if there's anything like that available. The alternative would be roll my own solution but that would be a pretty big task, to put it mildly.

Thanks.

like image 576
stifin Avatar asked Nov 04 '22 11:11

stifin


1 Answers

Global deduplication as you have described is typically handled outside of most typical virtual machine backup programs because CBT already tells you what blocks changed in a VM so you don't have to take a full backup every time. Global dedupe tends to be resource intensive too, so most folks would just get a Data Domain instead and take advantage of hardware (SSDs) and software (custom filesystems, variable length dedupe) that are dedicated, configured and optimized for deduping. Conceivably the backup program you are creating could take advantage of both CBT along with Data Domain's offerings in a way that some commercially available backup software are already able to do, like [Veeam][3]. More info on Data Domain's dedupe strategy ([variable length segments][4]).

well i had to delete two of my urls to post this answer cuz apparently i dont have enough rep... w/e

like image 114
borgified Avatar answered Nov 09 '22 06:11

borgified