Standard grep
/pcregrep
etc. can conveniently be used with binary files for ASCII or UTF8 data - is there a simple way to make them try UTF16 too (preferably simultaneously, but instead will do)?
Data I'm trying to get is all ASCII anyway (references in libraries etc.), it just doesn't get found as sometimes there's 00 between any two characters, and sometimes there isn't.
I don't see any way to get it done semantically, but these 00s should do the trick, except I cannot easily use them on command line.
UTF-16 (16- bit Unicode Transformation Format) is a standard method of encoding Unicode character data. Part of the Unicode Standard version 3.0 (and higher-numbered versions), UTF-16 has the capacity to encode all currently defined Unicode characters.
UTF-16 is an encoding of Unicode in which each character is composed of either one or two 16-bit elements. Unicode was originally designed as a pure 16-bit encoding, aimed at representing all modern scripts.
UTF-8/16/32 are simply different ways to encode this. In brief, UTF-32 uses 32-bit values for each character. That allows them to use a fixed-width code for every character. UTF-16 uses 16-bit by default, but that only gives you 65k possible characters, which is nowhere near enough for the full Unicode set.
UTF-16 allows all of the basic multilingual plane (BMP) to be represented as single code units. Unicode code points beyond U+FFFF are represented by surrogate pairs. The interesting thing is that Java and Windows (and other systems that use UTF-16) all operate at the code unit level, not the Unicode code point level.
The easiest way is to just convert the text file to utf-8 and pipe that to grep:
iconv -f utf-16 -t utf-8 file.txt | grep query
I tried to do the opposite (convert my query to utf-16) but it seems as though grep doesn't like that. I think it might have to do with endianness, but I'm not sure.
It seems as though grep will convert a query that is utf-16 to utf-8/ascii. Here is what I tried:
grep `echo -n query | iconv -f utf-8 -t utf-16 | sed 's/..//'` test.txt
If test.txt is a utf-16 file this won't work, but it does work if test.txt is ascii. I can only conclude that grep is converting my query to ascii.
EDIT: Here's a really really crazy one that kind of works but doesn't give you very much useful info:
hexdump -e '/1 "%02x"' test.txt | grep -P `echo -n Test | iconv -f utf-8 -t utf-16 | sed 's/..//' | hexdump -e '/1 "%02x"'`
How does it work? Well it converts your file to hex (without any extra formatting that hexdump usually applies). It pipes that into grep. Grep is using a query that is constructed by echoing your query (without a newline) into iconv which converts it to utf-16. This is then piped into sed to remove the BOM (the first two bytes of a utf-16 file used to determine endianness). This is then piped into hexdump so that the query and the input are the same.
Unfortunately I think this will end up printing out the ENTIRE file if there is a single match. Also this won't work if the utf-16 in your binary file is stored in a different endianness than your machine.
EDIT2: Got it!!!!
grep -P `echo -n "Test" | iconv -f utf-8 -t utf-16 | sed 's/..//' | hexdump -e '/1 "x%02x"' | sed 's/x/\\\\x/g'` test.txt
This searches for the hex version of the string Test
(in utf-16) in the file test.txt
I found the below solution worked best for me, from https://www.splitbits.com/2015/11/11/tip-grep-and-unicode/
Grep does not play well with Unicode, but it can be worked around. For example, to find,
Some Search Term
in a UTF-16 file, use a regular expression to ignore the first byte in each character,
S.o.m.e. .S.e.a.r.c.h. .T.e.r.m
Also, tell grep to treat the file as text, using '-a', the final command looks like this,
grep -a 'S.o.m.e. .S.e.a.r.c.h. .T.e.r.m' utf-16-file.txt
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With