I'm trying to do word-diff using git diff --word-diff-regex
[1]. Basically, any matches of this regex are considered a word. My document is a tab-delimited text file, and each column may contain the whitespace character. So, I tried to use negated character class, --word-diff-regex='[^\t]+'
, which should match everything except a tab, one or more times. However, it doesn't work. The regex seems to match everything on the line.
For example, with the text 20<\t>Hello, World
diff against 20<\t>Hello, Diff
(where <\t>
denote a tab character), git should show that the difference is in the whole "Hello, {World,Diff}", not the "World" or "Diff" by itself. Using [^\t]+
, however, causes git to shows that the entire line is a single word that changes.
Upon further research, it seems like git internally uses POSIX's regex function. And in POSIX's infinite wisdom, it seems like I "can’t escape anything in character classes" as "[t]hey treat backslashes in character classes as literal characters" [2].
Inspired by another StackOverflow answer [3], I currently work around this by using "Negated Shorthand Character Class", (\S| )+
. This matches anything non-whitespace, plus the whitespace character itself. This actually allows me to do word-diff in my case, but my question still remains, as this regex will not match other whitespace characters.
So, the question is, how can I match "everything except a tab" in POSIX (extended) regex (or a GNU extension), using or not using character class, without spelling all other characters in the whitespace class? For example, I don't want (\S| |\n|\r|<other whitespace characters>)+
.
[1] https://git-scm.com/docs/git-diff#Documentation/git-diff.txt---word-diff-regexltregexgt
[2] https://www.regular-expressions.info/charclass.html, section "Metacharacters Inside Character Classes"
[3] https://stackoverflow.com/a/3469155/9161044
It looks like --word-diff-regexp
behaves a bit like grep
, and does not undertsand escape sequences "natively".
Some ways to make it work :
git diff
has an (undocumented ?) -P | --perl-regexp
option :git diff -P --word-diff-regex='[^\t]+'
<tab>
character :
$'...'
to apply ANSI-C Quoting (bash reference) :git diff --word-diff-regex=$'[^\t]+'
ctrl+V
followed by <tab>
to insert a litteral <tab>
character in your command line:git diff --word-diff-regex='[^<ctrl+V <tab>>]+
$(...)
and a command that prints a litteral <tab>
(e.g : printf)git diff --word-diff-regex="[^$(printf '\t')]+"
sources :
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With