Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

why is LZMA SDK (7-zip) so slow

Tags:

I found 7-zip great and I will like to use it on .net applications. I have a 10MB file (a.001) and it takes:

enter image description here

2 seconds to encode.

Now it will be nice if I could do the same thing on c#. I have downloaded http://www.7-zip.org/sdk.html LZMA SDK c# source code. I basically copied the CS directory into a console application in visual studio: enter image description here

Then I compiled and eveything compiled smoothly. So on the output directory I placed the file a.001 which is 10MB of size. On the main method that came on the source code I placed:

[STAThread]     static int Main(string[] args)     {         // e stands for encode         args = "e a.001 output.7z".Split(' '); // added this line for debug          try         {             return Main2(args);         }         catch (Exception e)         {             Console.WriteLine("{0} Caught exception #1.", e);             // throw e;             return 1;         }     } 

when I execute the console application the application works great and I get the output a.7z on the working directory. The problem is that it takes so long. It takes about 15 seconds to execute! I have also tried https://stackoverflow.com/a/8775927/637142 approach and it also takes very long. Why is it 10 times slower than the actual program ?

Also

Even if I set to use only one thread: enter image description here

It still takes much less time (3 seconds vs 15):


(Edit) Another Possibility

Could it be because C# is slower than assembly or C ? I notice that the algorithm does a lot of heavy operations. For example compare these two blocks of code. They both do the same thing:

C

#include <time.h> #include<stdio.h>  void main() {     time_t now;       int i,j,k,x;     long counter ;      counter = 0;      now = time(NULL);      /* LOOP  */     for(x=0; x<10; x++)     {         counter = -1234567890 + x+2;          for (j = 0; j < 10000; j++)                  for(i = 0; i< 1000; i++)                                 for(k =0; k<1000; k++)                 {                     if(counter > 10000)                         counter = counter - 9999;                     else                         counter= counter +1;                 }          printf (" %d  \n", time(NULL) - now); // display elapsed time     }       printf("counter = %d\n\n",counter); // display result of counter              printf ("Elapsed time = %d seconds ", time(NULL) - now);     gets("Wait"); } 

output

enter image description here

c#

static void Main(string[] args) {            DateTime now;      int i, j, k, x;     long counter;      counter = 0;      now = DateTime.Now;      /* LOOP  */     for (x = 0; x < 10; x++)     {         counter = -1234567890 + x + 2;          for (j = 0; j < 10000; j++)                         for (i = 0; i < 1000; i++)                                 for (k = 0; k < 1000; k++)                 {                     if (counter > 10000)                         counter = counter - 9999;                     else                         counter = counter + 1;                 }           Console.WriteLine((DateTime.Now - now).Seconds.ToString());                 }      Console.Write("counter = {0} \n", counter.ToString());     Console.Write("Elapsed time = {0} seconds", DateTime.Now - now);     Console.Read(); } 

Output

enter image description here

Note how much slower was c#. Both programs where run from outside visual studio on release mode. Maybe that is the reason why it takes so much longer in .net than on c++.

Also I got the same results. C# was 3 times slower just like on the example I just showed!


Conclusion

I cannot seem to know what is causing the problem. I guess I will use 7z.dll and invoke the necessary methods from c#. A library that does that is at: http://sevenzipsharp.codeplex.com/ and that way I am using the same library that 7zip is using as:

    // dont forget to add reference to SevenZipSharp located on the link I provided     static void Main(string[] args)     {         // load the dll         SevenZip.SevenZipCompressor.SetLibraryPath(@"C:\Program Files (x86)\7-Zip\7z.dll");          SevenZip.SevenZipCompressor compress = new SevenZip.SevenZipCompressor();          compress.CompressDirectory("MyFolderToArchive", "output.7z");       } 
like image 391
Tono Nam Avatar asked Sep 06 '12 03:09

Tono Nam


People also ask

Why does 7 Zip take so long?

7-Zip uses all CPU cores of your system. Compressing large files will cause your CPU to run constantly with 100% on all cores. Therefore the heat is accumulating and may overload your cooling system after a certain time which forces the CPU to run slower and slower...

What is Lzma SDK?

The LZMA SDK provides the documentation, samples, header files, libraries, and tools you need to develop applications that use LZMA compression.


2 Answers

I ran a profiler on the code, and the most expensive operation appears to be in searching for matches. In C#, it's searching a single byte at a time. There are two functions (GetMatches and Skip) in LzBinTree.cs which contain the following code snippet, and it spends something like 40-60% of its time on this code:

if (_bufferBase[pby1 + len] == _bufferBase[cur + len]) {     while (++len != lenLimit)         if (_bufferBase[pby1 + len] != _bufferBase[cur + len])             break; 

It's basically trying to find the match length a single byte at a time. I extracted that into its own method:

if (GetMatchLength(lenLimit, cur, pby1, ref len)) { 

And if you use unsafe code and cast the byte* to a ulong* and compare 8 bytes at a time instead of 1, the speed almost doubled for my test data (in a 64 bit process):

private bool GetMatchLength(UInt32 lenLimit, UInt32 cur, UInt32 pby1, ref UInt32 len) {     if (_bufferBase[pby1 + len] != _bufferBase[cur + len])         return false;     len++;      // This method works with or without the following line, but with it,     // it runs much much faster:     GetMatchLengthUnsafe(lenLimit, cur, pby1, ref len);      while (len != lenLimit         && _bufferBase[pby1 + len] == _bufferBase[cur + len])     {         len++;     }     return true; }  private unsafe void GetMatchLengthUnsafe(UInt32 lenLimit, UInt32 cur, UInt32 pby1, ref UInt32 len) {     const int size = sizeof(ulong);     if (lenLimit < size)         return;     lenLimit -= size - 1;     fixed (byte* p1 = &_bufferBase[cur])     fixed (byte* p2 = &_bufferBase[pby1])     {         while (len < lenLimit)         {             if (*((ulong*)(p1 + len)) == *((ulong*)(p2 + len)))             {                 len += size;             }             else                 return;         }     } } 
like image 64
Bryce Wagner Avatar answered Oct 12 '22 22:10

Bryce Wagner


This kind of binary-arithmetic and branching-heavy code is what C-compilers love and what the .NET JIT hates. The .NET JIT is not a very smart compiler. It is optimized for fast compilation. If Microsoft wanted to tune it for maximum performance they would plug in the VC++ backend, but then intentionally don't.

Also, I can tell by the speed you are getting with 7z.exe (6MB/s) that you are using multiple cores, probably using LZMA2. My fast core i7 can deliver 2MB/s per core so I guess 7z.exe is running multi-threaded for you. Try turning on threading in the 7zip-library if that is possible.

I recommend that instead of using the managed code LZMA-algorithm you either use a natively compiled library or call 7z.exe using Process.Start. The latter one should get you started very quickly with good results.

like image 38
usr Avatar answered Oct 12 '22 20:10

usr