Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stripping hex bytes with sed - no match

Tags:

regex

macos

sed

hex

I have a text file with two non-ascii bytes (0xFF and 0xFE):

??58832520.3,ABC
348384,DEF

The hex for this file is:

FF FE 35 38 38 33 32 35 32 30 2E 33 2C 41 42 43 0A 33 34 38 33 38 34 2C 44 45 46

It's coincidental that FF and FE happen to be the leading bytes (they exist throughout my file, although seemingly always at the beginning of a line).

I am trying to strip these bytes out with sed, but nothing I do seems to match them.

$ sed 's/[^a-zA-Z0-9\,]//g' test.csv 
??588325203,ABC
348384,DEF

$ sed 's/[a-zA-Z0-9\,]//g' test.csv 
??.

Main question: How do I strip these bytes?
Bonus question: The two regex's above are direct negations, so one of them logically has to filter out these bytes, right? Why do both of these regex's match the 0xFF and 0xFE bytes?

Update: the direct approach of stripping out a range of hex byte (suggested by two answers below) seems to strip out the first "legit" byte from each line and leave the bytes I'm trying to get rid of:

$sed 's/[\x80-\xff]//' test.csv
??8832520.3,ABC
48384,DEF

FF FE 38 38 33 32 35 32 30 2E 33 2C 41 42 43 0A 34 38 33 38 34 2C 44 45 46 0A

Notice the missing "5" and "3" from the beginning of each line, and the new 0A added to the end of the file.

Bigger Update: This problem seems to be system-specific. The problem was observed on OSX, but the suggestions (including my original sed statement above) work as I expect them to on NetBSD.

A solution: This same task seems easy enough via Perl:

$ perl -pe 's/^\xFF\xFE//' test.csv
58832520.3,ABC
348384,DEF

However, I'll leave this question open since this is only a workaround, and doesn't explain what the problem was with sed.

like image 449
G__ Avatar asked Aug 08 '10 17:08

G__


4 Answers

sed 's/[^ -~]//g'

or as the other answer implies

sed 's/[\x80-\xff]//g'

See section 3.9 of the sed info pages. The chapter entitled escapes.

Edit for OSX, the native lang setting is en_US.UTF-8

try

LANG='' sed 's/[^ -~]//g' myfile

This works on an osx machine here, I'm not entirely sure why it does not work when in UTF-8

like image 53
deinst Avatar answered Nov 20 '22 05:11

deinst


This will strip out all lines that begin with the specific bytes FF FE

sed -e 's/\xff\xfe//g' hexquestion.txt

The reason that your negated regexes aren't working is that the [] specifies a character class. sed is assuming a particular character set, probably ascii. These characters in your file aren't 7 bit ascii characters, as they both begin with F. sed doesn't know how to deal with these. The solution above doesn't use character classes, so it should be more portable between platforms and character sets.

like image 20
Gary Avatar answered Nov 20 '22 04:11

Gary


The FF and FE bytes at the beginning of your file is what is called a "byte order mark (BOM)". It can appear at the start of Unicode text streams to indicate the endianness of the text. FF FE indicates UTF-16 in Little Endian

Here's an excerpt from the FAQ:

Q: How I should deal with BOMs?

A: Here are some guidelines to follow:

  1. A particular protocol (e.g. Microsoft conventions for .txt files) may require use of the BOM on certain Unicode data streams, such as files. When you need to conform to such a protocol, use a BOM.
  2. Some protocols allow optional BOMs in the case of untagged text. In those cases,
    • Where a text data stream is known to be plain text, but of unknown encoding, BOM can be used as a signature. If there is no BOM, the encoding could be anything.
    • Where a text data stream is known to be plain Unicode text (but not which endian), then BOM can be used as a signature. If there is no BOM, the text should be interpreted as big-endian.
  3. Some byte oriented protocols expect ASCII characters at the beginning of a file. If UTF-8 is used with these protocols, use of the BOM as encoding form signature should be avoided.
  4. Where the precise type of the data stream is known (e.g. Unicode big-endian or Unicode little-endian), the BOM should not be used. In particular, whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be used.

References

  • unicode.org/FAQ/UTF BOM

See also

  • Wikipedia/Byte order mark
  • Wikipedia/Endianness

Related questions

  • Why would I use a Unicode Signature Byte-Order-Mark (BOM)?
  • Difference between Big Endian and little Endian Byte order
like image 3
polygenelubricants Avatar answered Nov 20 '22 05:11

polygenelubricants


To show that this isn't an issue of the Unicode BOM, but an issue of eight-bit versus seven-bit characters and tied to the locale, try this:

Show all the bytes:

$ printf '123 abc\xff\xfe\x7f\x80' | hexdump -C
00000000  31 32 33 20 61 62 63 ff  fe 7f 80                 |123 abc....|

Have sed remove characters that aren't alpha-numeric in the user's locale. Notice that the space and 0x7f are removed:

$ printf '123 abc\xff\xfe\x7f\x80'|sed 's/[^[:alnum:]]//g' | hexdump -C
00000000  31 32 33 61 62 63 ff fe  80                       |123abc...|

Have sed remove characters that aren't alpha-numeric in the C locale. Notice that only "123abc" remains:

$ printf '123 abc\xff\xfe\x7f\x80'|LANG=C sed 's/[^[:alnum:]]//g' | hexdump -C
00000000  31 32 33 61 62 63                                 |123abc|
like image 2
Dennis Williamson Avatar answered Nov 20 '22 05:11

Dennis Williamson