Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to match everything except a tab (for git diff --word-diff-regex)

Tags:

git

regex

posix

I'm trying to do word-diff using git diff --word-diff-regex [1]. Basically, any matches of this regex are considered a word. My document is a tab-delimited text file, and each column may contain the whitespace character. So, I tried to use negated character class, --word-diff-regex='[^\t]+', which should match everything except a tab, one or more times. However, it doesn't work. The regex seems to match everything on the line.

For example, with the text 20<\t>Hello, World diff against 20<\t>Hello, Diff (where <\t> denote a tab character), git should show that the difference is in the whole "Hello, {World,Diff}", not the "World" or "Diff" by itself. Using [^\t]+, however, causes git to shows that the entire line is a single word that changes.

Upon further research, it seems like git internally uses POSIX's regex function. And in POSIX's infinite wisdom, it seems like I "can’t escape anything in character classes" as "[t]hey treat backslashes in character classes as literal characters" [2].

Inspired by another StackOverflow answer [3], I currently work around this by using "Negated Shorthand Character Class", (\S| )+. This matches anything non-whitespace, plus the whitespace character itself. This actually allows me to do word-diff in my case, but my question still remains, as this regex will not match other whitespace characters.

So, the question is, how can I match "everything except a tab" in POSIX (extended) regex (or a GNU extension), using or not using character class, without spelling all other characters in the whitespace class? For example, I don't want (\S| |\n|\r|<other whitespace characters>)+.

[1] https://git-scm.com/docs/git-diff#Documentation/git-diff.txt---word-diff-regexltregexgt

[2] https://www.regular-expressions.info/charclass.html, section "Metacharacters Inside Character Classes"

[3] https://stackoverflow.com/a/3469155/9161044

like image 615
Ratchanan Srirattanamet Avatar asked Nov 06 '22 03:11

Ratchanan Srirattanamet


1 Answers

It looks like --word-diff-regexp behaves a bit like grep, and does not undertsand escape sequences "natively".

Some ways to make it work :

  • use perl regexp : git diff has an (undocumented ?) -P | --perl-regexp option :
    git diff -P --word-diff-regex='[^\t]+'
  • tell your shell to insert a <tab> character :
    • (works in bash) use $'...' to apply ANSI-C Quoting (bash reference) :
      git diff --word-diff-regex=$'[^\t]+'
    • type ctrl+V followed by <tab> to insert a litteral <tab> character in your command line:
      git diff --word-diff-regex='[^<ctrl+V <tab>>]+
    • use $(...) and a command that prints a litteral <tab> (e.g : printf)
      git diff --word-diff-regex="[^$(printf '\t')]+"
    • ...

sources :

  • grep find lines that contains " \t"
  • How to grep for tabs without using literal tabs and why does \t not work? (on AskUbuntu)
like image 93
LeGEC Avatar answered Nov 14 '22 13:11

LeGEC