This is the error that I receive when I try to run tolower()
on a character vector from a file that cannot be changed (at least, not manually - too large).
Error in tolower(m) : invalid multibyte string X
It seems to be French company names that are the problem with the É
character. Although I have not investigated all of them (also not possible to do so manually).
It's strange, because my thought was that encoding issues would have been identified during read.csv()
, rather than during operations after the fact.
Is there a quick way to remove these multibyte strings? Or, perhaps a way to identify and convert? Or even just ignore them entirely?
If you try to run the code below and receive error messages such as invalid multibyte string , this is indicative of a character encoding issue that you will most likely need to resolve using one of the imperfect steps above.
A multibyte-string is one which uses more than one byte to store each character (probably a Unicode string).
Here's how I solved my problem:
First, I opened the raw data in a texteditor (Geany, in this case), clicked properties and identified the Encoding type.
After which I used the iconv()
function.
x <- iconv(x,"WINDOWS-1252","UTF-8")
To be more specific, I did this for every column of the data.frame
from the imported CSV. Important to note that I set stringsAsFactors=FALSE
in my read.csv()
call.
dat[,sapply(dat,is.character)] <- sapply(
dat[,sapply(dat,is.character)],
iconv,"WINDOWS-1252","UTF-8")
I was getting the same error. However, in my case it wasn't when I was reading the file, but a bit later when processing it. I realised that I was getting the error, because the file wasn't read with the correct encoding in the first place.
I found a much simpler solution (at least for my case) and wanted to share. I simply added encoding as below and it worked.
read.csv(<path>, encoding = "UTF-8")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With