Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to prevent 'read.table' from changing underscores and hyphens to dots?

I have a bunch of files which I'm merging in one data frame. The file names are as such: unc.edu.b6530750-0410-43ec-bb79-f862ca3424a6.1918120.rsem.genes.results

And I want the file names to be the column names. I'm using the following code:

for (file in file_list){

  if (!exists("dataset")){
      dataset <- read.table(file, header=TRUE, colClasses = c(rep("character", 2),                     rep("NULL", 2)), col.names = c("gene_id", deparse(substitute(file)), "NuLL", "NULL"), sep="\t")
      print(deparse(substitute(file)))
    }

    if (exists("dataset")){
      temp_dataset <-read.table(file, header=TRUE, colClasses = c(rep("character", 2), rep("NULL", 2)), col.names = c("gene_id", deparse(substitute(file)), "NuLL", "NULL"), sep="\t")
      print(deparse(substitute(file)))
      dataset<-merge(dataset, temp_dataset, by = "gene_id")
      rm(temp_dataset)
    }
}

All goes well except that the column names now have underscores replaced by dots.

colnames(data)

[1] "gene_id"                                                                       
[2] "X...unc.edu.02cb8dbe.ef56.471c.b52d.41c29219fd95.1794854.rsem.genes.results..x"
[3] "X...unc.edu.02cb8dbe.ef56.471c.b52d.41c29219fd95.1794854.rsem.genes.results..y"
[4] "X...unc.edu.02f5dcba.bdcc.4424.aed4.195a8d551325.2085643.rsem.genes.results."  

Any explanation as to what causes this would be helpful because I will need to change these names, using another file, later on.

like image 633
paul_dg Avatar asked Aug 24 '14 12:08

paul_dg


1 Answers

As @akrun stated in the comments, read.table(file, ..., check.names=FALSE) will solve the immediate problem.

However, there are now neater ways to achieve what you're trying to do using some of the tidyverse packages.

First let's load packages and generate some sample data:

library(purrr)
library(readr)
data <- c("gene_id\tresult\trandom_a\trandom_b
TNF\t1e-8\t1.7\t4.3
IL8\t0.4\t-0.3\t8.6",
"gene_id\tresult\trandom_a\trandom_b
TNF\t2.4e-7\t1.7\t4.3
IL8\t0.9\t0.8\t8.3",
"gene_id\tresult\trandom_a\trandom_b
TNSF8\t0.003\t2.1\t9.7
IL8\t0.02\t1.9\t4.6")
file_list <- sprintf("file_%d.csv", 1:3)
walk2(data, file_list, ~write_tsv(read_tsv(.x), .y))

Now here's the actual bit that reads and merges the data:

library(purrr)
library(readr)
library(dplyr)
dataset <- file_list %>%
  map(~read_tsv(.x, col_types = "cc__", col_names = c("gene_id", .x), skip = 1)) %>%
  reduce(full_join, by = "gene_id")

This uses map to read in each file one by one, skipping the first presumably header row and the third and fourth columns, and renames the resulting columns as gene_id and with the name of the file. These are then sequentially joined using dplyr::full_join and purrr::reduce.

Although this question was asked a long time ago, this type of task is common, so I thought a tidyverse-based answer would still be useful. (And it's still in the 'unanswered questions with votes' filter.)

like image 107
Nick Kennedy Avatar answered Oct 15 '22 08:10

Nick Kennedy