Invalid multibyte string in read.csv

Q: What is a multibyte string in R?

A multibyte-string is one which uses more than one byte to store each character (probably a Unicode string).

Q: What is a multibyte string?

A null-terminated multibyte string (NTMBS), or "multibyte string", is a sequence of nonzero bytes followed by a byte with value zero (the terminating null character). Each character stored in the string may occupy more than one byte.

Q: How do I read a csv file in R?

To load a. csv file into the current script and operate with it, use the read. csv() method in base R. The output is delivered as a data frame, with row numbers given to integers starting at 1.

Q: What package is Read_csv in R?

Before you can use the read_csv function, you have to load readr, the R package that houses read_csv.

Tags:

r

read.csv

I am trying to import a csv that is in Japanese. This code:

url <- 'http://www.mof.go.jp/international_policy/reference/itn_transactions_in_securities/week.csv' x <- read.csv(url, header=FALSE, stringsAsFactors=FALSE)

returns the following error:

Error in type.convert(data[[i]], as.is = as.is[i], dec = dec, na.strings = character(0L)) :  invalid multibyte string at '<91>ΊO<8b>y<82>ёΓ<e0><8f>،<94><94><84><94><83><8c>_<96>񓙂̏󋵁@(<8f>T<8e><9f><81>E<8e>w<92><e8><95>񍐋@<8a>փx<81>[<83>X<81>j'

I tried changing the encoding (Encoding(url) <- 'UTF-8' and also to latin1) and tried removing the read.csv parameters, but received the same "invalid multibyte string" message in each case. Is there a different encoding that should be used, or is there some other problem?

766

asked Jan 16 '13 16:01

jaredwoodard

1 Answers

Encoding sets the encoding of a character string. It doesn't set the encoding of the file represented by the character string, which is what you want.

This worked for me, after trying "UTF-8":

x <- read.csv(url, header=FALSE, stringsAsFactors=FALSE, fileEncoding="latin1")

And you may want to skip the first 16 lines, and read in the headers separately. Either way, there's still quite a bit of cleaning up to do.

x <- read.csv(url, header=FALSE, stringsAsFactors=FALSE,   fileEncoding="latin1", skip=16) # get started with the clean-up x[,1] <- gsub("\u0081|`", "", x[,1])    # get rid of odd characters x[,-1] <- as.data.frame(lapply(x[,-1],  # convert to numbers   function(d) type.convert(gsub(d, pattern=",", replace=""))))

188

answered Oct 22 '22 01:10

Joshua Ulrich

Related questions
                            
                                Extract matrix column values by matrix column name
                            
                                How to slice data from a middle index until the end without using `length` in R (like you can in python)?
                            
                                Adjust Transparency (alpha) of stat_smooth lines, not just transparency of Confidence Interval
                            
                                lambda-like functions in R?
                            
                                dplyr: How to use group_by inside a function?
                            
                                Fast vectorized merge of list of data.frames by row
                            
                                Looping over a Date or POSIXct object results in a numeric iterator
                            
                                How do I open a script file in RStudio using an R command?
                            
                                How to annotate() ggplot with latex
                            
                                Subset rows in a data frame based on a vector of values
                            
                                Fill and border colour in geom_point (scale_colour_manual) in ggplot
                            
                                Grouped bar plot in ggplot
                            
                                How can I count runs in a sequence?
                            
                                Replace values in a dataframe based on lookup table
                            
                                heatmap with values (ggplot2)
                            
                                Put whisker ends on boxplot
                            
                                Aggregate multiple columns at once [duplicate]
                            
                                preallocate list in R
                            
                                Unimplemented type list when trying to write.table
                            
                                Parsing command line arguments in R scripts

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With