Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read a UTF-8 text file with BOM

I have a text file with Byte order mark (U+FEFF) at the beginning. I am trying to read the file in R. Is it possible to avoid the Byte order mark?

The function fread (from the data.table package) reads the file, but adds ļ»æ at the beginning of the first variable name:

> names(frame_pers)[1]
[1] "ļ»æreg_date"

The same is with read.csv function.

Currently I have made a function which removes the BOM from the first column name, but I believe there should be a way how to automatically strip the BOM.

remove.BOM <- function(x) setnames(x, 1, substring(names(x)[1], 4))

> names(frame_pers)[1]
[1] "ļ»æreg_date"
> remove.BOM(frame_pers)
> names(frame_pers)[1]
[1] "reg_date"

I am using the native encoding for the R session:

> options("encoding" = "")
> options("encoding")
$encoding
[1] ""
like image 723
djhurio Avatar asked Feb 07 '14 10:02

djhurio


People also ask

How do I view UTF-8 BOM?

To check if BOM character exists, open the file in Notepad++ and look at the bottom right corner. If it says UTF-8-BOM then the file contains BOM character.

Does UTF-8 have BOM?

The Unicode Standard permits the BOM in UTF-8, but does not require or recommend its use. Byte order has no meaning in UTF-8, so its only use in UTF-8 is to signal at the start that the text stream is encoded in UTF-8, or that it was converted to UTF-8 from a stream that contained an optional BOM.

How do I add UTF-8 to BOM?

To Add BOM to a UTF-8 file, we can directly write Unicode \ufeff or three bytes 0xEF , 0xBB , 0xBF at the beginning of the UTF-8 file. The Unicode \ufeff represents 0xEF , 0xBB , 0xBF , read this.


Video Answer


2 Answers

Have you tried read.csv(..., fileEncoding = "UTF-8-BOM")?. ?file says:

As from R 3.0.0 the encoding ‘"UTF-8-BOM"’ is accepted and will remove a Byte Order Mark if present (which it often is for files and webpages generated by Microsoft applications).

like image 189
hadley Avatar answered Sep 29 '22 13:09

hadley


This was handled between versions 1.9.6 and 1.9.8 with this commit; update your data.table installation to fix this.

Once done, you can just use fread:

fread("file_name.csv")
like image 28
MichaelChirico Avatar answered Sep 29 '22 13:09

MichaelChirico