This simply baffles me. I've just downloaded a 1.5GB tarball of the Chrome source code. The same code compiled compresses to about 50MB.
Why is there such a discrepancy between the size of the source code, and the size of the executable?
A list of things that could cause this.
The executable has no need of whitespace, comments or any of the nice formatting stuff. The source code might have TONS of documentation and whitespace just to make the code readable and all this takes up space.
The source code might bring along with it a LOT of other code to test the application. But this test code doesn't ever make it to the final application.
Documentation that is included with the code. Depending on the format, .doc or .docx files, the documentation might be huge.
Someone else mentioned that source control comments might be in the code as well. Icluding commit messages in source code can make the files large as well.
I don't know how/when you did the file comparison but if you did it AFTER compile time then you might have included the compile artifacts ( the *.o files ) in your calculation as well. So you might be perceiving the source code to be 1.5GB when it's really only 750 MB ( roughly speaking ).
Depending on the compiler and how good it is, it might generate less assembly code and thus create smaller file. Although I think most compilers today are reasonable and this shouldn't account for too much size variance. ( but i could be wrong, i'm not a compiler person )
If the application is being statically compiled with all the libraries it would be bigger because now it has to contain it's dependencies within it. However, if the libraries are dynamically linked/loaded the executable itself might be drastically smaller since it will just link to the libraries during runtime and only load them as needed.
Was the tarball 1.5GB or was the expanded tarball 1.5GB?
Anyway, lots of factors could be at play here.
There's an average of 1621 bytes for the copyright/license at the top of all source code files. Chromium (without any svn/git/object/image files) has 73,510 source files (for this purpose, i kept it at .cc,.h,.cpp,.idl,.m,.js,.c,.py).
That's 119159710 bytes of just copyright notices.
Or 116366 kilobytes
Or 133 megabytes. Just. in.. copyright notices..
To make matters worse there are open bugs on Chromium indicating they may even be in violation of their own license since they intermix quite a few different flavors and versions of open (and some not so open) licenses. [1]
Sources:
[1] https://code.google.com/p/chromium/issues/detail?id=28291
[2] I work with the chromium source code:
Trevors-Mac:src trevor$ find . -name "*.cc" | wc -l
15941
Trevors-Mac:src trevor$ find . -name "*.h" | wc -l
26125
Trevors-Mac:src trevor$ find . -name "*.cpp" | wc -l
5191
Trevors-Mac:src trevor$ find . -name "*.idl" | wc -l
881
Trevors-Mac:src trevor$ find . -name "*.m" | wc -l
258
Trevors-Mac:src trevor$ find . -name "*.js" | wc -l
13528
Trevors-Mac:src trevor$ find . -name "*.c" | wc -l
7856
Trevors-Mac:src trevor$ find . -name "*.py" | wc -l
3988
Trevors-Mac:src trevor$
Well, put it this way: When you write assembly, you might spell out MOV 0,eax
(or whatever, I don't actually know assembly) and it gets compiled down to just a few bytes.
Higher level languages typically take up more space than their compiled-down machine-code, because they need to be made human readable. Another example: 2147483647 takes 10 bytes when spelled out in the source code, but only 4 when compiled.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With