I am frustrated that grep fails to find a word like "hello" in my UTF-16 documents.
Can anyone recommend a version of grep that attempts to guess the file encoding and then properly handle it?
You'll definitely want to check out ack
.
It supports Unicode encodings, and is basically grep, but better.
If you are under Linux, Unix, etc. you may want to change your LANG envariable to an encoding to match your documents.
Check your locale first. Here is what mine is set to by default on my MacBook Pro:
$ locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=
say, under bash:
$ LANG="foo" grep 'gotta be found now' file.name
something a little more permanent (be careful with this):
$ export LANG="foo"
$ grep 'bar' mitz.vah
Perl has a way better regex syntax than grep (much more powerful), it has UTF8 and UTF16 support, but I'm not sure how good it is at guessing the encoding... if you tell it which encoding to use, though, it can read these files without any issues and run regexes over them. You'll have to write yourself a tiny Perl program for that (your own micro-grep implementation in Perl so to say), but that isn't too hard. Perl exists for all major operating systems.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With