Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: Why does read.table stop reading a file?

Tags:

r

I have a file, called genes.txt, which I'd like to become a data.frame. It's got a lot of lines, each line has three, tab delimited fields:

mike$ wc -l genes.txt
   42476 genes.txt

I'd like to read this file into a data.frame in R. I use the command read.table, like this:

genes = read.table(
    genes_file, 
    sep="\t", 
    na.strings="-", 
    fill=TRUE,
    col.names=c("GeneSymbol","synonyms","description")
)

Which seems to work fine, where genes_file points at genes.txt. However, the number of lines in my data.frame is significantly less than the number of lines in my text file:

> nrow(genes)
[1] 27896

and things I can find in the text file:

mike$ grep "SELL" genes.txt 
SELL    CD62L|LAM1|LECAM1|LEU8|LNHR|LSEL|LYAM1|PLNHR|TQ1    selectin L

don't seem to be in the data.frame

> grep("SELL",genes$GeneSymbol)
integer(0)

it turns out that

genes = read.delim(
    genes_file,
    header=FALSE,
    na.strings="-",
    fill=TRUE,
    col.names=c("GeneSymbol","synonyms","description"),
)

works just fine. Why does read.delim work when read.table not?

If it's of use, you can recreate genes.txt using the following commands which you should run from a command line

curl -O ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz
gzip -cd gene_info.gz | awk -Ft '$1==9606{print $3 "\t" $5 "\t" $9}' > genes.txt

be warned, though, that gene_info.gz is 101MBish.

like image 258
Mike Dewar Avatar asked Jun 10 '10 16:06

Mike Dewar


1 Answers

With read.table one of the default quote characters is the single quote. I'm guessing you have some unmatched single quotes in your description field and all the data between single quotes is being pooled together into one entry.

With read.delim the defualt quote character is the double quote and thus this isn't a problem.

Specify your quote character and you should be all set.

> genes<-read.table("genes.txt",sep="\t",quote="\"",na.strings="-",fill=TRUE, col.names=c("GeneSymbol","synonyms","description"))
> nrow(genes)
[1] 42476
like image 94
Brian Avatar answered Oct 01 '22 13:10

Brian