Read.CSV not working as expected in R

Tags:

I am stumped. Normally, read.csv works as expected, but I have come across an issue where the behavior is unexpected. It most likely is user error on my part, but any help will be appreciated.

Here is the URL for the file

http://nces.ed.gov/ipeds/datacenter/data/SFA0910.zip

Here is my code to get the file, unzip, and read it in:

 URL <- "http://nces.ed.gov/ipeds/datacenter/data/SFA0910.zip"
 download.file(URL, destfile="temp.zip")
 unzip("temp.zip")
 tmp <- read.table("sfa0910.csv", 
                   header=T, stringsAsFactors=F, sep=",", row.names=NULL)

Here is my problem. When I open the data csv data in Excel, the data look as expected. When I read the data into R, the first column is actually named row.names. R is reading in one extra row of data, but I can't figure out where the "error" occurs that is causing row.names to be a column. Simply, it looks like the data shifted over.

However, what is strange is that the last column in R does appear to contain the proper data.

Here are a few rows from the first few columns:

tmp[1:5,1:7]
  row.names UNITID XSCUGRAD SCUGRAD XSCUGFFN SCUGFFN XSCUGFFP
1    100654      R     4496       R     1044       R       23
2    100663      R    10646       R     1496       R       14
3    100690      R      380       R        5       R        1
4    100706      R     6119       R      774       R       13
5    100724      R     4638       R     1209       R       26

Any thoughts on what I could be doing wrong?

236

asked Aug 15 '12 23:08

Btibert3

2 Answers

My tip: use count.fields() as a quick diagnostic when delimited files do not behave as expected.

First, count the number of fields using table():

table(count.fields("sfa0910.csv", sep = ","))
# 451  452 
#   1 6852

That tells you that all but one of the lines contains 452 fields. So which is the aberrant line?

which(count.fields("sfa0910.csv", sep = ",") != 452)
# [1] 1

The first line is the problem. On inspection, all lines except the first are terminated by 2 commas.

The question now is: what does that mean? Is there supposed to be an extra field in the header row which was omitted? Or were the 2 commas appended to the other lines in error? It may be best to contact whoever generated the data, if possible, to clarify the ambiguity.

answered Oct 03 '22 20:10

neilfws

I have a fix maybe based on mnel's comments

dat<-readLines(paste("sfa", '0910', ".csv", sep=""))
ncommas<-sapply(seq_along(dat),function(x){sum(attributes(gregexpr(',',dat[x])[[1]])$match.length)})
> head(ncommas)
[1] 450 451 451 451 451 451

all columns after the first have an extra seperator which excel ignores.

for(i in seq_along(dat)[-1]){
dat[i]<-gsub('(.*),','\\1',dat[i])
}
write(dat,'temp.csv')

tmp<-read.table('temp.csv',header=T, stringsAsFactors=F, sep=",")

> tmp[1:5,1:7]
  UNITID XSCUGRAD SCUGRAD XSCUGFFN SCUGFFN XSCUGFFP SCUGFFP
1 100654        R    4496        R    1044        R      23
2 100663        R   10646        R    1496        R      14
3 100690        R     380        R       5        R       1
4 100706        R    6119        R     774        R      13
5 100724        R    4638        R    1209        R      26

the moral of the story .... listen to Joshua Ulrich ;)

Quick fix. Open the file in excel and save it. This will also delete the extra seperators.

Alternatively

dat<-readLines(paste("sfa", '0910', ".csv", sep=""),n=1)
dum.names<-unlist(strsplit(dat,','))
tmp <- read.table(paste("sfa", '0910', ".csv", sep=""), 
                   header=F, stringsAsFactors=F,col.names=c(dum.names,'XXXX'),sep=",",skip=1)
tmp1<-tmp[,-dim(tmp)[2]]

answered Oct 03 '22 21:10

shhhhimhuntingrabbits

Related questions
                            
                                Locking R shiny dashboard sidebar (shinydashboard)
                            
                                change value to percentage of row in R [duplicate]
                            
                                Creating a contingency table using multiple columns in a data frame in R
                            
                                Hide/show outputs Shiny R
                            
                                Perform an operation on a vector using the previous value after an initial value
                            
                                R: `ID : Coercing LHS to a list` in adding an ID column, why?
                            
                                Move axis labels in between plot and facet strip
                            
                                group by in dplyr and calculating percentages
                            
                                How to convert list of list into a tibble (dataframe)
                            
                                Count consecutive TRUE values within each block separately [duplicate]
                            
                                merge data frame and named vector
                            
                                How to sort rows of a data frame based on a vector using dplyr pipe
                            
                                Querying Oracle DB from Revolution R using RODBC
                            
                                Deleting specific rows from a data frame
                            
                                Paste together two character vectors of different lengths
                            
                                k-means: Same clusters for every execution
                            
                                Testing if rows of a matrix or data frame are sorted in R
                            
                                Import stuff from a R file
                            
                                R: find nearest index
                            
                                Fitting logarithmic curve in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With