Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can you read a CSV file in R with different number of columns

I have a sparse data set, one whose number of columns vary in length, in a csv format. Here is a sample of the file text.

12223, University 12227, bridge, Sky 12828, Sunset 13801, Ground 14853, Tranceamerica 14854, San Francisco 15595, shibuya, Shrine 16126, fog, San Francisco 16520, California, ocean, summer, golden gate, beach, San Francisco 

When I use

read.csv("data.txt", header = F) 

R will interpret the data set as having 3 columns because the size is determined from the first 5 rows. Is there anyway to force r to put the data in more columns?

like image 849
CompChemist Avatar asked Sep 20 '13 17:09

CompChemist


People also ask

Do all lines in a CSV files have the same number of columns?

A CSV file should have the same number of columns in each row. A CSV file stores data in rows and the values in each row is separated with a separator, also known as a delimiter.

What is the difference between read table and read csv in R?

csv() as well as the read. csv2() function are almost identical to the read. table() function, with the sole difference that they have the header and fill arguments set as TRUE by default. Tip: if you want to learn more about the arguments that you can use in the read.

How many ways can you read a csv file?

There are two common ways to read a . csv file when using Python. The first by using the csv library, and the second by using the pandas library.

How do I read a csv file in RStudio?

In RStudio, click on the Workspace tab, and then on “Import Dataset” -> “From text file”. A file browser will open up, locate the . csv file and click Open. You'll see a dialog that gives you a few options on the import.


2 Answers

Deep in the ?read.table documentation there is the following:

The number of data columns is determined by looking at the first five lines of input (or the whole file if it has less than five lines), or from the length of col.names if it is specified and is longer. This could conceivably be wrong if fill or blank.lines.skip are true, so specify col.names if necessary (as in the ‘Examples’).

Therefore, let's define col.names to be length X (where X is the max number of fields in your dataset), and set fill = TRUE:

dat <- textConnection("12223, University 12227, bridge, Sky 12828, Sunset 13801, Ground 14853, Tranceamerica 14854, San Francisco 15595, shibuya, Shrine 16126, fog, San Francisco 16520, California, ocean, summer, golden gate, beach, San Francisco")  read.table(dat, header = FALSE, sep = ",",    col.names = paste0("V",seq_len(7)), fill = TRUE)       V1             V2             V3      V4           V5     V6             V7 1 12223     University                                                           2 12227         bridge            Sky                                            3 12828         Sunset                                                           4 13801         Ground                                                           5 14853  Tranceamerica                                                           6 14854  San Francisco                                                           7 15595        shibuya         Shrine                                            8 16126            fog  San Francisco                                            9 16520     California          ocean  summer  golden gate  beach  San Francisco 

If the maximum number of fields is unknown, you can use the nifty utility function count.fields (which I found in the read.table example code):

count.fields(dat, sep = ',') # [1] 2 3 2 2 2 2 3 3 7 max(count.fields(dat, sep = ',')) # [1] 7 

Possibly helpful related reading: Only read limited number of columns in R

like image 157
Blue Magister Avatar answered Oct 12 '22 01:10

Blue Magister


You could read the data like this:

dat <- textConnection("12223, University 12227, bridge, Sky 12828, Sunset 13801, Ground 14853, Tranceamerica 14854, San Francisco 15595, shibuya, Shrine 16126, fog, San Francisco 16520, California, ocean, summer, golden gate, beach, San Francisco")  dat <- readLines(dat) dat <- strsplit(dat, ",") 

This results in a list.

like image 39
Roland Avatar answered Oct 12 '22 01:10

Roland