Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Encoding issue with write.xlsx (openxlsx)

I use the write.xlsx() function (from the openxlsx package) to turn a list object into an excel spreadsheet, where each element of the list is converted into a "sheet" of the excel file. In the past, this function has been incredibly useful, and I have never encountered any issues. It is my understanding that this package and function in particular does not need any particular java update on the computer in order for it to work.

However, recently I discovered that the function is producing error. This is what it states in the console when I run the write.xlsx() for the list:

Error in gsub("&", "&", v, fixed = TRUE) : 
  input string 5107 is invalid UTF-8

I've identified the dataframes that are the cause of the issue, but I am not sure how to identify which part of the data frame is causing the error.

I've even went ahead and used the enc2utf8() function for all of the columns in this data frame in particular but I still encounter the error. I've used the substr() function on the data frame itself, and it shows me the first n characters of each column, though I do not see any obvious issues from the output.

I've even went ahead and used the install.packages() function to re-download the openxlsx package again, in case of any updates.

Does anyone know how I would go about identifying the cause of the error? Is it the function as it is written in the package? If the problem is in the encoding of the data itself, does the enc2utf8() not suffice to resolve the issue?

Thanks!

like image 639
im2wddrf Avatar asked Sep 12 '18 16:09

im2wddrf


2 Answers

I just had this same problem. Building on this question, you could replace all bad characters in the dataframe with either:

library(dplyr)
df %>%
  mutate_if(is.character, ~gsub('[^ -~]', '', .))

for only character columns, or:

df %>%
  mutate_all(~gsub('[^ -~]', '', .))  

for all columns, and then export to XLSX with write.xlsx().

like image 149
sbha Avatar answered Oct 24 '22 18:10

sbha


As far as finding the error, the number given points you to the problem (in your case, 5107). This appears to be counting the strings that are written to the file. To find the particular data point that's the issue, this approach worked for me:

Let's assume our data frame has 20 variables and 10 of them are character type.

  • Subtract the number of variables, if you are writing the column headers (because all of those are strings) 5107-20 = 5087
  • Divide the remainder by the number of character variables per observation (5087/10 = 508.7); that means that the problem is in row 509 (because there are 5080+20=5100 strings between the headers and the first 508 rows)
  • The 7th character variable in the 509th row will be your problem child.
like image 26
Dov Rosenberg Avatar answered Oct 24 '22 18:10

Dov Rosenberg