Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do popular source control systems differentiate binary files from text files

Looking for articles, documentation or straight head knowledge of how different source control systems differentiate (or detect) the type of file (binary vs. text). Of particular interest is how Git does it vs Mercurial.

Do they look at: File extensions? File signatures or content (ie. is this file UTF8)? A mix of things?

like image 431
codenheim Avatar asked Aug 18 '11 16:08

codenheim


People also ask

How does binary file differ from text file?

Text files are organized around lines, each of which ends with a newline character ('\n'). The source code files are themselves text files. A binary file is the one in which data is stored in the file in the same way as it is stored in the main memory for processing.

What is the advantage of binary files as compared to text files?

One of the advantages of binary files is that they are more efficient. In terms of memory, storing values using numeric formats such as IEEE 754, rather than as text characters, tends to use less memory. In addition, binary formats also offer advantages in terms of speed of access.

What is the difference between text Io and binary IO?

In the text file, the newline character is converted to carriage-return/linefeed before being written to the disk. In binary file, conversion of newline to carriage-return and linefeed does not take place. Text files are used to store data more user friendly. Binary files are used to store data more compactly.

Why binary files are faster than text files?

Binary files also usually have faster read and write times than text files, because a binary image of the record is stored directly from memory to disk (or vice versa). In a text file, everything has to be converted back and forth to text, and this takes time. C supports the file-of-structures concept very cleanly.


2 Answers

SVN:

When you first add or import a file into Subversion, the file is examined to determine if it is a binary file. Currently, Subversion just looks at the first 1024 bytes of the file; if any of the bytes are zero, or if more than 15% are not ASCII printing characters, then Subversion calls the file binary. This heuristic might be improved in the future, however.

http://subversion.apache.org/faq.html#binary-files

Git works in a similar way. Git usually guesses correctly whether a blob contains text or binary data by examining the beginning of the contents - It checks for any occurrence of a zero byte (NUL “character”) in the first 8000 bytes.

http://git-scm.com/docs/gitattributes

And from Git source:

 #define FIRST_FEW_BYTES 8000
 int buffer_is_binary(const char *ptr, unsigned long size)
 {
         if (FIRST_FEW_BYTES < size)
                 size = FIRST_FEW_BYTES;
         return !!memchr(ptr, 0, size);
 }

http://git.kernel.org/?p=git/git.git;a=blob;f=xdiff-interface.c;h=0e2c169227ad29b5bf546c6c1b97e1a1d8ed7409;hb=HEAD

And @tonfa makes a good point that "Also note that the only place where it cares about a file being text vs. binary is for diplaying diff, and for doing merges. The storage format does not care about it."

like image 151
manojlds Avatar answered Oct 13 '22 01:10

manojlds


Mercurial looks for some occurence of the null character (\0) in the content of the file. If there's one, then the file is considered as binary. Otherwise it is considered as textual, unless explicitely mentionned.

I guess git uses the same approach.

like image 33
gizmo Avatar answered Oct 12 '22 23:10

gizmo