Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to determine if Git handles a file as binary or as text?

Tags:

git

I know that Git somehow automatically detects if a file is binary or text and that .gitattributes can be used to set this manually if needed. But is there also a way to ask Git how it treats a file?

So let's say I have a Git repository with two files in it: An ascii.dat file containing plain-text and a binary.dat file containing random binary stuff. Git handles the first .dat file as text and the secondary file as binary. Now I want to write a Git web front end which has a viewer for text files and a special viewer for binary files (displaying a hex dump for example). Sure, I could implement my own text/binary check but it would be more useful if the viewer relies on the information how Git handles these files.

So how can I ask Git if it treats a file as text or binary?

like image 435
kayahr Avatar asked May 25 '11 05:05

kayahr


People also ask

How can you tell if a file is text or binary?

We can usually tell if a file is binary or text based on its file extension. This is because by convention the extension reflects the file format, and it is ultimately the file format that dictates whether the file data is binary or text.

Does git handle binary files?

Git LFS is a Git extension used to manage large files and binary files in a separate Git repository. Most projects today have both code and binary assets.

Which git command should return the binary?

Use git check-attr --all .

Does git work with non text files?

Many people want to version control non-text files, such as images, PDFs and Microsoft Office or LibreOffice documents. It is true that Git can handle these filetypes (which fall under the banner of “binary” file types).


7 Answers

builtin_diff()1 calls diff_filespec_is_binary() which calls buffer_is_binary() which checks for any occurrence of a zero byte (NUL “character”) in the first 8000 bytes (or the entire length if shorter).

I do not see that this “is it binary?” test is explicitly exposed in any command though.

git merge-file directly uses buffer_is_binary(), so you may be able to make use of it:

git merge-file /dev/null /dev/null file-to-test

It seems to produce the error message like error: Cannot merge binary files: file-to-test and yields an exit status of 255 when given a binary file. I am not sure I would want to rely on this behavior though.

Maybe git diff --numstat would be more reliable:

isBinary() {
    p=$(printf '%s\t-\t' -)
    t=$(git diff --no-index --numstat /dev/null "$1")
    case "$t" in "$p"*) return 0 ;; esac
    return 1
}
isBinary file-to-test && echo binary || echo not binary

For binary files, the --numstat output should start with - TAB - TAB, so we just test for that.


1builtin_diff() has strings like Binary files %s and %s differ that should be familiar.

like image 72
Chris Johnsen Avatar answered Oct 03 '22 08:10

Chris Johnsen


git grep -I --name-only --untracked -e . -- ascii.dat binary.dat ...

will return the names of files that git interprets as text files.

The trick here is in these two git grep parameters:

  • -I: Don’t match the pattern in binary files.
  • -e .: Regular expression match any character in the file

You can use wildcards e.g.

git grep -I --name-only --untracked -e . -- *.ps1
like image 41
cstork Avatar answered Oct 03 '22 10:10

cstork


I don't like this answer, but you can parse the output of git-diff-tree to see if it is binary. For example:

git diff-tree -p 4b825dc642cb6eb9a060e54bf8d69288fbee4904 HEAD -- MegaCli 
diff --git a/megaraid/MegaCli b/megaraid/MegaCli
new file mode 100755
index 0000000..7f0e997
Binary files /dev/null and b/megaraid/MegaCli differ

as opposed to:

git diff-tree -p 4b825dc642cb6eb9a060e54bf8d69288fbee4904 HEAD -- megamgr
diff --git a/megaraid/megamgr b/megaraid/megamgr
new file mode 100755
index 0000000..50fd8a1
--- /dev/null
+++ b/megaraid/megamgr
@@ -0,0 +1,78 @@
+#!/bin/sh
[…]

Oh, and BTW, 4b825d… is a magic SHA which represents the empty tree (it is the SHA for an empty tree, but git is specially aware of this magic).

like image 26
Seth Robertson Avatar answered Oct 03 '22 09:10

Seth Robertson


Use git check-attr --all.

This works regardless of if the file has been staged/committed or not.

Tested on git version 2.30.2.

Assuming you have this in .gitattributes.

package-lock.json binary

There is this output.

git check-attr --all package-lock.json 
package-lock.json: binary: set
package-lock.json: diff: unset
package-lock.json: merge: unset
package-lock.json: text: unset

For normal files, there is no output.

git check-attr --all package.json
like image 33
thnee Avatar answered Oct 03 '22 10:10

thnee


# considered binary (or with bare CR) file
git ls-files --eol | grep -E '^(i/-text)'

# files that do not have any line-ending characters (including empty files) - unlikely that this is a true binary file ?
git ls-files --eol | grep -E '^(i/none)'

#                                                        via experimentation
#                                                      ------------------------
#    "-text"        binary (or with bare CR) file     : not    auto-normalized
#    "none"         text file without any EOL         : not    auto-normalized
#    "lf"           text file with LF                 : is     auto-normalized when gitattributes text=auto
#    "crlf"         text file with CRLF               : is     auto-normalized when gitattributes text=auto
#    "mixed"        text file with mixed line endings : is     auto-normalized when gitattributes text=auto
#                   (LF or CRLF, but not bare CR)

Source: https://git-scm.com/docs/git-ls-files#Documentation/git-ls-files.txt---eol https://github.com/git/git/commit/a7630bd4274a0dff7cff8b92de3d3f064e321359

Oh by the way: be careful with setting the .gitattributes text attribute e.g. *.abc text. Because in that case all files with *.abc will be normalized, even if they are binary (internal CRLF found in the binary would be normalized to LF). This is different from the auto behaviour.

like image 31
Quential33 Avatar answered Oct 03 '22 08:10

Quential33


At the risk of getting slapped for poor code quality, I'm listing a C utility, is_binary, built around the original buffer_is_binary() routine in the Git source. Please see internal comments for how to build and run. Easily modifyable:

/***********************************************************
 * is_binary.c 
 *
 * Usage: is_binary <pathname>
 *   Returns a 1 if a binary; return a 0 if non-binary
 * 
 * Thanks to Git and Stackoverflow developers for helping with these routines:
 * - the buffer_is_binary() routine from the xdiff-interface.c module 
 *   in git source code.
 * - the read-a-filename-from-stdin route
 * - the read-a-file-into-memory (fill_buffer()) routine
 *
 * To build:
 *    % gcc is_binary.c -o is_binary
 *
 * To build debuggable (to push a few messages to stdout):
 *    % gcc -DDEBUG=1 ./is_binary.c -o is_binary
 *
 * BUGS:
 *  Doesn't work with piped input, like 
 *    % cat foo.tar | is_binary 
 *  Claims that zero input is binary. Actually, 
 *  what should it be?
 *
 * Revision 1.4
 *
 * Tue Sep 12 09:01:33 EDT 2017
***********************************************************/
#include <string.h>
#include <stdio.h>
#include <stdlib.h>

#define MAX_PATH_LENGTH 200
#define FIRST_FEW_BYTES 8000

/* global, unfortunately */
char *source_blob_buffer;

/* From: https://stackoverflow.com/questions/14002954/c-programming-how-to-read-the-whole-file-contents-into-a-buffer */

/* From: https://stackoverflow.com/questions/1563882/reading-a-file-name-from-piped-command */

/* From: https://stackoverflow.com/questions/6119956/how-to-determine-if-git-handles-a-file-as-binary-or-as-text
*/

/* The key routine in this function is from libc: void *memchr(const void *s, int c, size_t n); */
/* Checks for any occurrence of a zero byte (NUL character) in the first 8000 bytes (or the entire length if shorter). */

int buffer_is_binary(const char *ptr, unsigned long size)
{
  if (FIRST_FEW_BYTES < size)
    size = FIRST_FEW_BYTES;
    /* printf("buff = %s.\n", ptr); */
  return !!memchr(ptr, 0, size);
}
int fill_buffer(FILE * file_object_pointer) {
  fseek(file_object_pointer, 0, SEEK_END);
  long fsize = ftell(file_object_pointer);
  fseek(file_object_pointer, 0, SEEK_SET);  //same as rewind(f);
  source_blob_buffer = malloc(fsize + 1);
  fread(source_blob_buffer, fsize, 1, file_object_pointer);
  fclose(file_object_pointer);
  source_blob_buffer[fsize] = 0;
  return (fsize + 1);
}
int main(int argc, char *argv[]) {

  char pathname[MAX_PATH_LENGTH];
  FILE *file_object_pointer;

  if (argc == 1) {
    file_object_pointer = stdin;
  } else {
    strcpy(pathname,argv[1]);
#ifdef DEBUG
    printf("pathname=%s.\n", pathname); 
#endif 
    file_object_pointer = fopen (pathname, "rb");
    if (file_object_pointer == NULL) {
      printf ("I'm sorry, Dave, I can't do that--");
      printf ("open the file '%s', that is.\n", pathname);
      exit(3);
    }
  }
  if (!file_object_pointer) {
    printf("Not a file nor a pipe--sorry.\n");
    exit (4);
  }
  int fsize = fill_buffer(file_object_pointer);
  int result = buffer_is_binary(source_blob_buffer, fsize - 2);

#ifdef DEBUG
  if (result == 1) {
    printf ("%s %d\n", pathname, fsize - 1);
  }
  else {
    printf ("File '%s' is NON-BINARY; size is %d bytes.\n", pathname, fsize - 1); 
  }
#endif
  exit(result);
  /* easy check -- 'echo $?' after running */
}
like image 22
yoder2000 Avatar answered Oct 03 '22 10:10

yoder2000


@bonh gave a working answer in a comment

git diff --numstat 4b825dc642cb6eb9a060e54bf8d69288fbee4904 HEAD -- | grep "^-" | cut -f 3

It shows all files which git interprets as binaries.

like image 26
Oli Dev Avatar answered Oct 03 '22 09:10

Oli Dev