Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove hex values from data.table in R

Tags:

I have a large data table called Site (300,000 rows, 100 columns). Throughout the data table are hex values, for example: "\x96" or "\xc9." I want all of these values to be removed. They follow the format of "\x" followed by two characters (numbers or letters).

Here is the function that replaces values. I can do each individually as shown below, but I want a general command that will get rid of all hex values in the table.

Site<- as.data.table(apply(Site, 2, function(x) gsub("\x8e", "", x)))

I tried to use regular expression syntax, "\x..", but got this error:

Error: '\x' used without hex digits in character string starting ""\x"

How can I remove these hex values? Any help is greatly appreciated!

Here is a reproducible example:

dt <- data.table(A = c("Th\xa1is","is","the","first\x12"), B = c("This","\x45is","the","second"))

I want "\xa1", "\x12", and "\x45" removed so the table looks like:

       A      B
1:  This   This
2:    is     is
3:   the    the
4: first second
like image 501
Michael Berk Avatar asked Dec 01 '17 19:12

Michael Berk


1 Answers

You are confused. And so am I. And so are most of us. With characters, their encoding and their display.

The relevant sections of the help are hard to locate. ?Quotes gives us a piece of the puzzle. "\x" or "\x" followed by anything but 1 or 2 digits (or letters between a and f) don't even make sense to the R parser.

Between "\x01" and "\x7f" you'll find the "traditional" ASCII table. identical("\x30", "0"), identical("\x39", "9"), identical("\x41", "A"), identical("\x5A", "Z"), for instance, are all TRUE.

Then in the 128 other values allowed by this notation, between "\x80" and "\xff", you'll find the rest of the so-called "Latin 1" table.

Then there is Unicode for all other characters, and the ubiquitous UTF-8 encoding.

So when you say "remove all hex values", one can only assume those between "\x80" and "\xff" are the characters that trouble you. Maybe there's a problem with the way those characters are displayed. Or an encoding problem. Or some of them are just control characters. But let's just remove them all as you asked:

dt[, lapply(.SD, gsub, pattern = "[\x80-\xff]", replacement = "")]

should do. Or if you want to be even more radical, and remove everything that is not ASCII: dt[, lapply(.SD, gsub, pattern = "[^\x01-\x7f]", replacement = "")].

Also noteworthy: R (unlike Python) doesn't have raw strings, and I suspect that's where the intial confusion in the comments stems from. Where in Python you can either do "\\" or r"\" to have an actual backslash in a string, in R, you can't. You only have the option to escape it: "\\". In the regex101 example given, there is Th\xa1is in the test string. But this is different from what you have in R when you do "Th\xa1is".
(Edit: Since R version 4.0, we now have raw strings: r"(Th\xa1is)" gives [1] "Th\\xa1is")

like image 66
Aurèle Avatar answered Sep 19 '22 12:09

Aurèle