Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Importing unquoted strings as factors using read_csv from the readr package in R

I have a .csv datafile with many columns. Unfortunately, string values do not have quotation marks (i.e., apples i.o. "apples). When I use read_csv from the readr package, the string values are imported as characters:

library(readr)

mydat = data.frame(first = letters, numbers = 1:26, second = sample(letters, 26))
write.csv(mydat, "mydat.csv", quote = FALSE, row.names = FALSE)

read_csv("mydat.csv")

results in:

Parsed with column specification:
cols(
  first = col_character(),
  numbers = col_integer(),
  second = col_character()
)
# A tibble: 26 x 3
   first numbers second
   <chr>   <int>  <chr>
1      a       1      r
2      b       2      n
3      c       3      m
4      d       4      z
5      e       5      p
6      f       6      j
7      g       7      u
8      h       8      l
9      i       9      e
    10     j      10      h
    # ... with 16 more rows

Is there a way to force read_csv to import the string values as factors i.o. characters?

Importantly, my datafile has so many columns (string and numeric variables) that, AFAIK, there is no way to make this work by providing column specifications with the col_types argument.

Alternative solutions (e.g. using read.csv to import the data, or dplyr code to change all character variables in a dataframe to factors) are appreciated too.

Update: I learned that whether or not the values in the csv file have quotes or not makes no difference for read.csv or read_csv. read.csv will import these values as factors; read_csv will import them as characters. I prefer to use read_csv because it's considerably faster than read.csv.

like image 340
user2363777 Avatar asked Nov 01 '16 19:11

user2363777


People also ask

What does Readr package do in R?

The readr package makes it easy to get rectangular data out of comma separated (csv), tab separated (tsv) or fixed width files (fwf) and into R. It is designed to flexibly parse many types of data found in the wild, while still cleanly failing when data unexpectedly changes.

What package is read_csv in R?

Before you can use the read_csv function, you have to load readr, the R package that houses read_csv.

What does the stringsAsFactors argument in read CSV () do?

The argument 'stringsAsFactors' is an argument to the 'data. frame()' function in R. It is a logical that indicates whether strings in a data frame should be treated as factor variables or as just plain strings.

What does read_csv mean?

Read a comma-separated values (csv) file into DataFrame. Also supports optionally iterating or breaking of the file into chunks.


2 Answers

This function uses dplyr to convert all character columns in a tbl_df or data frame to factors:

char.to.factors <- function(df){
  # This function takes a tbl_df and returns same with any character column converted to a factor

  require(dplyr)

  char.cols = names(df)[sapply(df, function(x) {class(x) == "character" })]
  tmp = mutate_each_(df, funs(as.factor), char.cols)
  return(tmp)
}
like image 60
Sean Mullane Avatar answered Oct 24 '22 07:10

Sean Mullane


I like the alistaire's mutate_if() solution in the comments above, but for completeness, there is another solution which should be mentioned. You can use unclass() which will force a re-parse. You'll see this in a lot of code that uses readr.

df <- data.frame(unclass(fr))

or

df <- df %>% unclass %>% data.frame
like image 42
Steve Rowe Avatar answered Oct 24 '22 07:10

Steve Rowe