I have a file, called genes.txt
, which I'd like to become a data.frame. It's got a lot of lines, each line has three, tab delimited fields:
mike$ wc -l genes.txt
42476 genes.txt
I'd like to read this file into a data.frame in R. I use the command read.table, like this:
genes = read.table(
genes_file,
sep="\t",
na.strings="-",
fill=TRUE,
col.names=c("GeneSymbol","synonyms","description")
)
Which seems to work fine, where genes_file
points at genes.txt
. However, the number of lines in my data.frame is significantly less than the number of lines in my text file:
> nrow(genes)
[1] 27896
and things I can find in the text file:
mike$ grep "SELL" genes.txt
SELL CD62L|LAM1|LECAM1|LEU8|LNHR|LSEL|LYAM1|PLNHR|TQ1 selectin L
don't seem to be in the data.frame
> grep("SELL",genes$GeneSymbol)
integer(0)
it turns out that
genes = read.delim(
genes_file,
header=FALSE,
na.strings="-",
fill=TRUE,
col.names=c("GeneSymbol","synonyms","description"),
)
works just fine. Why does read.delim work when read.table not?
If it's of use, you can recreate genes.txt
using the following commands which you should run from a command line
curl -O ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz
gzip -cd gene_info.gz | awk -Ft '$1==9606{print $3 "\t" $5 "\t" $9}' > genes.txt
be warned, though, that gene_info.gz is 101MBish.
With read.table one of the default quote characters is the single quote. I'm guessing you have some unmatched single quotes in your description field and all the data between single quotes is being pooled together into one entry.
With read.delim the defualt quote character is the double quote and thus this isn't a problem.
Specify your quote character and you should be all set.
> genes<-read.table("genes.txt",sep="\t",quote="\"",na.strings="-",fill=TRUE, col.names=c("GeneSymbol","synonyms","description"))
> nrow(genes)
[1] 42476
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With