Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Error in tolower() invalid multibyte string

Tags:

r

This is the error that I receive when I try to run tolower() on a character vector from a file that cannot be changed (at least, not manually - too large).

Error in tolower(m) : invalid multibyte string X

It seems to be French company names that are the problem with the É character. Although I have not investigated all of them (also not possible to do so manually).

It's strange, because my thought was that encoding issues would have been identified during read.csv(), rather than during operations after the fact.

Is there a quick way to remove these multibyte strings? Or, perhaps a way to identify and convert? Or even just ignore them entirely?

like image 602
Brandon Bertelsen Avatar asked Nov 02 '12 00:11

Brandon Bertelsen


People also ask

What does invalid multibyte string mean in R?

If you try to run the code below and receive error messages such as invalid multibyte string , this is indicative of a character encoding issue that you will most likely need to resolve using one of the imperfect steps above.

What is a multibyte string in R?

A multibyte-string is one which uses more than one byte to store each character (probably a Unicode string).


2 Answers

Here's how I solved my problem:

First, I opened the raw data in a texteditor (Geany, in this case), clicked properties and identified the Encoding type.

After which I used the iconv() function.

x <- iconv(x,"WINDOWS-1252","UTF-8")

To be more specific, I did this for every column of the data.frame from the imported CSV. Important to note that I set stringsAsFactors=FALSE in my read.csv() call.

dat[,sapply(dat,is.character)] <- sapply(
    dat[,sapply(dat,is.character)],
    iconv,"WINDOWS-1252","UTF-8")
like image 159
Brandon Bertelsen Avatar answered Sep 18 '22 15:09

Brandon Bertelsen


I was getting the same error. However, in my case it wasn't when I was reading the file, but a bit later when processing it. I realised that I was getting the error, because the file wasn't read with the correct encoding in the first place.

I found a much simpler solution (at least for my case) and wanted to share. I simply added encoding as below and it worked.

read.csv(<path>, encoding = "UTF-8")

like image 38
Onur Ece Avatar answered Sep 18 '22 15:09

Onur Ece