Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compressing a folder with many duplicated files [closed]

I have a pretty big folder (~10GB) that contains many duplicated files throughout it's directory tree. Many of these files are duplicated up 10 times. The duplicated files don't reside side by side, but within different sub-directories.

How can I compress the folder to a make it small enough?

I tried to use Winrar in "Best" mode, but it didn't compress it at all. (Pretty strange)

Will zip\tar\cab\7z\ any other compression tool do a better job?

I don't mind letting the tool work for a few hours - but not more.

I rather not do it programmatically myself

like image 764
user972014 Avatar asked Dec 13 '14 09:12

user972014


People also ask

How can I compress a folder so much?

Locate the file or folder that you want to zip. Press and hold (or right-click) the file or folder, select (or point to) Send to, and then select Compressed (zipped) folder. A new zipped folder with the same name is created in the same location.

What happens when you compress folders?

Compressing files decreases their size and reduces the amount of space they use on your drives or removable storage devices. Zipped compressed files and folders can be moved to any drive or folder on your computer, the Internet, or your network, and they are compatible with other file compression programs.

Does compressing folders save space?

File compression reduces the size of the file as much as 90%, without losing any of the primary data. Compressing a file is also known as zipping. Therefore, file compression helps the user save a considerable amount of disk space.


2 Answers

Best options in your case is 7-zip. Here is the options:

7za a -r -t7z -m0=lzma2 -mx=9 -mfb=273 -md=29 -ms=8g -mmt=off -mmtf=off -mqs=on -bt -bb3 archife_file_name.7z /path/to/files

a - add files to archive

-r - Recurse subdirectories

-t7z - Set type of archive (7z in your case)

-m0=lzma2 - Set compression method to LZMA2. LZMA is default and general compression method of 7z format. The main features of LZMA method:

  • High compression ratio
  • Variable dictionary size (up to 4 GB)
  • Compressing speed: about 1 MB/s on 2 GHz CPU
  • Decompressing speed: about 10-20 MB/s on 2 GHz CPU
  • Small memory requirements for decompressing (depend from dictionary size)
  • Small code size for decompressing: about 5 KB
  • Supporting multi-threading and P4's hyper-threading

-mx=9 - Sets level of compression. x=0 means Copy mode (no compression). x=9 - Ultra

-mfb=273 - Sets number of fast bytes for LZMA. It can be in the range from 5 to 273. The default value is 32 for normal mode and 64 for maximum and ultra modes. Usually, a big number gives a little bit better compression ratio and slower compression process.

-md=29 - Sets Dictionary size for LZMA. You must specify the size in bytes, kilobytes, or megabytes. The maximum value for dictionary size is 1536 MB, but 32-bit version of 7-Zip allows to specify up to 128 MB dictionary. Default values for LZMA are 24 (16 MB) in normal mode, 25 (32 MB) in maximum mode (-mx=7) and 26 (64 MB) in ultra mode (-mx=9). If you do not specify any symbol from the set [b|k|m|g], the dictionary size will be calculated as DictionarySize = 2^Size bytes. For decompressing a file compressed by LZMA method with dictionary size N, you need about N bytes of memory (RAM) available.

I use md=29 because on my server there is 16Gb only RAM available. using this settings 7-zip takes only 5Gb on any directory size archiving. If I use bigger dictionary size - system goes to swap.

-ms=8g - Enables or disables solid mode. The default mode is s=on. In solid mode, files are grouped together. Usually, compressing in solid mode improves the compression ratio. In your case this is very important to make solid block size as big as possible.

Limitation of the solid block size usually decreases compression ratio. The updating of solid .7z archives can be slow, since it can require some recompression.

-mmt=off - Sets multithreading mode to OFF. You need to switch it off because we need similar or identical files to be processed by same 7-zip thread in one soled block. Drawback is slow archiving. Does not matter how many CPUs or cores your system have.

-mmtf=off - Set multithreading mode for filters to OFF.

-myx=9 - Sets level of file analysis to maximum, analysis of all files (Delta and executable filters).

-mqs=on - Sort files by type in solid archives. To store identical files together.

-bt - show execution time statistics -bb3 - set output log level

like image 66
Ara Saahov Avatar answered Sep 30 '22 17:09

Ara Saahov


7-zip supports the 'WIM' file format which will detect and 'compress' duplicates. If you're using the 7-zip GUI then you simply select the 'wim' file format.

Only if you're using command line 7-zip, see this answer. https://serverfault.com/questions/483586/backup-files-with-many-duplicated-files

like image 36
drojf Avatar answered Sep 30 '22 18:09

drojf