Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

An encoding-savvy grep replacement?

I am frustrated that grep fails to find a word like "hello" in my UTF-16 documents.

Can anyone recommend a version of grep that attempts to guess the file encoding and then properly handle it?

like image 787
fish Avatar asked Mar 05 '09 00:03

fish


2 Answers

ack as perl-based grep replacement?

You'll definitely want to check out ack.

It supports Unicode encodings, and is basically grep, but better.

try a matching Unicode locale with grep

If you are under Linux, Unix, etc. you may want to change your LANG envariable to an encoding to match your documents.

Check your locale first. Here is what mine is set to by default on my MacBook Pro:

 $ locale 
 LANG="en_US.UTF-8"
 LC_COLLATE="en_US.UTF-8"
 LC_CTYPE="en_US.UTF-8"
 LC_MESSAGES="en_US.UTF-8"
 LC_MONETARY="en_US.UTF-8"
 LC_NUMERIC="en_US.UTF-8"
 LC_TIME="en_US.UTF-8" 
 LC_ALL=

say, under bash:

$ LANG="foo" grep 'gotta be found now' file.name

something a little more permanent (be careful with this):

$ export LANG="foo"
$ grep 'bar' mitz.vah
like image 188
popcnt Avatar answered Nov 15 '22 02:11

popcnt


Perl has a way better regex syntax than grep (much more powerful), it has UTF8 and UTF16 support, but I'm not sure how good it is at guessing the encoding... if you tell it which encoding to use, though, it can read these files without any issues and run regexes over them. You'll have to write yourself a tiny Perl program for that (your own micro-grep implementation in Perl so to say), but that isn't too hard. Perl exists for all major operating systems.

like image 42
Mecki Avatar answered Nov 15 '22 02:11

Mecki