I have a bunch of files which I'm merging in one data frame. The file names are as such: unc.edu.b6530750-0410-43ec-bb79-f862ca3424a6.1918120.rsem.genes.results
And I want the file names to be the column names. I'm using the following code:
for (file in file_list){
if (!exists("dataset")){
dataset <- read.table(file, header=TRUE, colClasses = c(rep("character", 2), rep("NULL", 2)), col.names = c("gene_id", deparse(substitute(file)), "NuLL", "NULL"), sep="\t")
print(deparse(substitute(file)))
}
if (exists("dataset")){
temp_dataset <-read.table(file, header=TRUE, colClasses = c(rep("character", 2), rep("NULL", 2)), col.names = c("gene_id", deparse(substitute(file)), "NuLL", "NULL"), sep="\t")
print(deparse(substitute(file)))
dataset<-merge(dataset, temp_dataset, by = "gene_id")
rm(temp_dataset)
}
}
All goes well except that the column names now have underscores replaced by dots.
colnames(data)
[1] "gene_id"
[2] "X...unc.edu.02cb8dbe.ef56.471c.b52d.41c29219fd95.1794854.rsem.genes.results..x"
[3] "X...unc.edu.02cb8dbe.ef56.471c.b52d.41c29219fd95.1794854.rsem.genes.results..y"
[4] "X...unc.edu.02f5dcba.bdcc.4424.aed4.195a8d551325.2085643.rsem.genes.results."
Any explanation as to what causes this would be helpful because I will need to change these names, using another file, later on.
As @akrun stated in the comments, read.table(file, ..., check.names=FALSE)
will solve the immediate problem.
However, there are now neater ways to achieve what you're trying to do using some of the tidyverse packages.
First let's load packages and generate some sample data:
library(purrr)
library(readr)
data <- c("gene_id\tresult\trandom_a\trandom_b
TNF\t1e-8\t1.7\t4.3
IL8\t0.4\t-0.3\t8.6",
"gene_id\tresult\trandom_a\trandom_b
TNF\t2.4e-7\t1.7\t4.3
IL8\t0.9\t0.8\t8.3",
"gene_id\tresult\trandom_a\trandom_b
TNSF8\t0.003\t2.1\t9.7
IL8\t0.02\t1.9\t4.6")
file_list <- sprintf("file_%d.csv", 1:3)
walk2(data, file_list, ~write_tsv(read_tsv(.x), .y))
Now here's the actual bit that reads and merges the data:
library(purrr)
library(readr)
library(dplyr)
dataset <- file_list %>%
map(~read_tsv(.x, col_types = "cc__", col_names = c("gene_id", .x), skip = 1)) %>%
reduce(full_join, by = "gene_id")
This uses map
to read in each file one by one, skipping the first presumably header row and the third and fourth columns, and renames the resulting columns as gene_id
and with the name of the file. These are then sequentially joined using dplyr::full_join
and purrr::reduce
.
Although this question was asked a long time ago, this type of task is common, so I thought a tidyverse-based answer would still be useful. (And it's still in the 'unanswered questions with votes' filter.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With