Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to do an accent insensitive grep?

Is there a way to do an accent insensitive search using grep, preferably keeping the --color option ? By this I mean grep --secret-accent-insensitive-option aei would match àei but also äēì and possibly æi.

I know I can use iconv -t ASCII//TRANSLIT to remove accents from a text, but I don't see how I can use it to match since the text is transformed (it would work for grep -c or -l)

like image 811
dargaud Avatar asked Jan 05 '14 19:01

dargaud


1 Answers

You are looking for a whole bunch of POSIX regex equivalence classes:

14.3.6.2 Equivalence Class Operators ([= … =])

    Regex recognizes equivalence class expressions inside lists. A equivalence class expression is a set of collating elements which all belong to the same equivalence class. You form an equivalence class expression by putting a collating element between an open-equivalence-class operator and a close-equivalence-class operator. [= represents the open-equivalence-class operator and =] represents the close-equivalence-class operator. For example, if a and A were an equivalence class, then both [[=a=]] and [[=A=]] would match both a and A. If the collating element in an equivalence class expression isn’t part of an equivalence class, then the matcher considers the equivalence class expression to be a collating symbol.

I'm using carets on the next line to indicate what is actually colored. I also tweaked the test string to illustrate a point about case.

$ echo "I match àei but also äēì and possibly æi" | grep '[[=a=]][[=e=]][[=i=]]'
I match àei but also äēì and possibly æi
        ^^^          ^^^

This matches all words like aei. The fact that it does not match æi should stand as a reminder that you're beholden to whatever mapping exists in the regex library you're using (presumably gnulib, which is what I linked and quoted), though I figure it's quite likely that digraphs are beyond the reach of even the best equivalence class map.

You should not expect equivalence classes to be portable as they are too arcane.


Taking this a step further, if you want ONLY accented characters, things get far more complicated. Here I've changed your request for aei into [aei].

$ echo "I match àei but also äēì and possibly æi" | grep '[[=a=][=e=][=i=]]'
I match àei but also äēì and possibly æi
^  ^    ^^^     ^    ^^^ ^       ^     ^

Cleaning this up to avoid non-accent matches would require both equivalence classes and look-ahead/look-behind, and while BRE (basic POSIX regex) and ERE (extended POSIX regex) support the former, they both lack the latter. Libpcre (the C library for perl-compatible regex that grep -P and most others use) and perl support the latter but lack the former:

Try #1: grep with libpcre: failure

$ echo "I match àei but also äēì and possibly æi" \
    | grep -P '[[=a=][=e=][=i=]](?<![aei])'
grep: POSIX collating elements are not supported

Try #2: perl itself: failure

$ echo "I match àei but also äēì and possibly æi" \
    | perl -ne 'print if /[[=a=][=e=][=i=]](?<![aei])/'
POSIX syntax [= =] is reserved for future extensions in regex; marked by <-- HERE in m/[[=a=][=e= <-- HERE ][=i=]](?<![aei])/ at -e line 1.

Try #3: python (which has its own PCRE implementation): (silent) failure

$ echo "I match àei but also äēì and possibly æi" \
    | python -c 'import re, sys;
                 print re.findall(r"[[=a=][=e=][=i=]]", sys.stdin.read())'
[]

Wow, a regex feature that PCRE, python, and even perl don't support! There aren't too many of those. (Never mind the complaint being on the second equivalence class, it still complains given just /[[=a=]]/.) This as further evidence that equivalence classes are arcane.

In fact, it appears that there aren't any PCRE libraries capable of equivalence classes; the section on equivalence classes at regular-expressions.info claims only the regex libraries implementing the POSIX standard actually have this support. GNU grep gets closest since it can do BRE, ERE, and PCRE, but it can't combine them.

So we'll do it in two parts.

Try #4: disgusting trickery: success

$ echo "I match àei but also äēì and possibly æi" \
    | grep --color=always '[[=a=][=e=][=i=]]' \
    | perl -pne "s/\e\[[0-9;]*m\e\[K(?i)([aei])/\$1/g"
I match àei but also äēì and possibly æi
        ^            ^^^

Code walk:

  • grep forces color on so that perl can key on the color codes to note the matches
  • ${GREP_COLOR:-01;31} notes grep's color (with the same bright red default)
  • perl's s/// command matches the full color code and then the non-accented letters that we want to remove from the final results. It replaces all of that with the (uncolored) letters
  • Anything after (?i) in the perl regex is case-insensitve since [[=i=]] matches I
  • perl -p prints each line of its input upon completion of its -e execution

For more on BRE vs ERE vs PCRE and others, see this StackExchange regex post or the POSIX regexps at regular-expressions.info. For more on per-language differences (including libpcre vs python PCRE vs perl), look to tools at regular-expressions.info.


2019 Updates: GNU Grep now uses $GREP_COLORS which can look like ms=1;41 which takes priority over the older $GREP_COLOR like 1;41. This is harder to extract (and it's hard to juggle between the two), so I modified the perl code in try #4 to seek out any SGR color code instead of keying on just the color that grep would add. See revision 2 of this answer for the previous code.

I cannot currently verify whether BSD grep, which is used by Apple Mac OS X, supports POSIX regex equivalence classes.

like image 63
Adam Katz Avatar answered Oct 26 '22 12:10

Adam Katz