I use R for most of my statistical analysis. However, cleaning/processing data, especially when dealing with sizes of 1Gb+, is quite cumbersome. So I use common UNIX tools for that. But my question is, is it possible to, say, run them interactively in the middle of an R session? An example: Let's say file1
is the output dataset from an R processes, with 100 rows. From this, for my next R process, I need a specific subset of columns 1 and 2, file2
, which can be easily extracted through cut
and awk.
So the workflow is something like:
Some R process => file1
cut --fields=1,2 <file1 | awk something something >file2
Next R process using file2
Apologies in advance if this is a foolish question.
Try this (adding other read.table
arguments if needed):
# 1
DF <- read.table(pipe("cut -fields=1,2 < data.txt| awk something_else"))
or in pure R:
# 2
DF <- read.table("data.txt")[1:2]
or to not even read the unwanted fields assuming there are 4 fields:
# 3
DF <- read.table("data.txt", colClasses = c(NA, NA, "NULL", "NULL"))
The last line could be modified for the case where we know we want the first two fields but don't know how many other fields there are:
# 3a
n <- count.fields("data.txt")[1]
read.table("data.txt", header = TRUE, colClasses = c(NA, NA, rep("NULL", n-2)))
The sqldf package can be used. In this example we assume a csv file, data.csv
and that the desired fields are called a
and b
. If its not a csv file then use appropriate arguments to read.csv.sql
to specify other separator, etc. :
# 4
library(sqldf)
DF <- read.csv.sql("data.csv", sql = "select a, b from file")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With