Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Override column types when importing data using readr::read_csv() when there are many columns

I am trying to read a csv file using readr::read_csv in R. The csv file that I am importing has about 150 columns, I am just including the first few columns for the example. I am looking to override the second column from the default type (which is date when I do read_csv) to character, or other date format.

GIS Join Match Code Data File Year  State Name  State Code  County Name County   Code   Area Name   Persons: Total G0100010    2008-2012   Alabama 1   Autauga County  1   Autauga County, Alabama 54590  df <- data.frame("GIS Join Match Code"="G0100010", "Data File" = "2008-2012", "State" = "Alabama", "County" = "Autauga County", "Population" = 54590) 

The issue is that when I use readr::read_csv, it seems I may have to use all variables while overriding in the col_types (see error below). That is need to specify overriding all the 150 columns individually(?).. The question is that : Is there a way to specify overriding the col_type of just specific columns, or a named list of objects? In my case, it would be just overriding the column "Data File Year".

I understand that any omitted columns will be automatically parsed, which is fine for my analysis. I think it gets further complex as the column names have a space in them in the file I downloaded (For e.g., "Data File Year", "State Code") etc.

tempdata <- read_csv(df, col_types = "cc") Error: You have 135 column names, but 2 columns 

The Other option I guess, if possible, is to just skip reading the second column all together?

like image 430
rajvijay Avatar asked Jul 22 '15 16:07

rajvijay


People also ask

Which function is preferable for large datasets read csv or read_csv?

base R. Here read_csv() is far superior to read. csv() .

What is the difference between read csv and read_csv in R?

The read_csv function imports data into R as a tibble, while read. csv imports a regular old R data frame instead.

How does read_csv work in R?

read_csv() reads comma delimited files, read_csv2() reads semicolon separated files (common in countries where , is used as the decimal place), read_tsv() reads tab delimited files, and read_delim() reads in files with any delimiter.

What does read_csv mean?

One of the most widely used functions of Pandas is read_csv which reads comma-separated values (csv) files and creates a DataFrame.


2 Answers

Here follows a more generic answer to this question if someone happens to stumble upon this in the future. It is less advisable to use "skip" to jump columns as this will fail to work if the imported data source structure is changed.

It could be easier in your example to simply set a default column type, and then define any columns that differ from the default.

E.g., if all columns typically are "d", but the date column should be "D", load the data as follows:

  read_csv(df, col_types = cols(.default = "d", date = "D")) 

or if, e.g., column date should be "D" and column "xxx" be "i", do so as follows:

  read_csv(df, col_types = cols(.default = "d", date = "D", xxx = "i")) 

The use of "default" above is powerful if you have multiple columns and only specific exceptions (such as "date" and "xxx").

like image 123
Nick Avatar answered Sep 21 '22 23:09

Nick


Yes. For example to force numeric data to be treated as characters:

examplecsv = "a,b,c\n1,2,a\n3,4,d" read_csv(examplecsv) # A tibble: 2 x 3 #      a     b     c #  <int> <int> <chr> #1     1     2     a #2     3     4     d read_csv(examplecsv, col_types = cols(b = col_character())) # A tibble: 2 x 3 #      a     b     c #  <int> <chr> <chr> #1     1     2     a #2     3     4     d 

Choices are:

col_character()  col_date() col_time()  col_datetime()  col_double()  col_factor() # to enforce, will never be guessed col_integer()  col_logical()  col_number()  col_skip() # to force skip column 

More: http://readr.tidyverse.org/articles/readr.html

like image 26
Lukasz Avatar answered Sep 18 '22 23:09

Lukasz