I have a data (large data 125000 rows, ~20 MB) in which some of the rows with certain string need to be deleted and some columns need to be selected during the reading process.
Firstly, I discovered that grepl
function does not work properly since fread
makes the data as one column indicated also in this question.
The example data can be found here (by following @akrun advice) and header of the data like this
head(sum_data)
TRIAL : 1 3331 9091
TRIAL : 2 1384786531 278055555
2 0.10 0.000E+00 -0.0047 -0.0168 -0.9938 -0.0087 -0.0105 -0.9709 0.0035 0.0079 -0.9754 0.0081 0.0023 0.9997 -0.135324E-09 0.278754E-01
2 0.20 0.000E+00 -0.0121 0.0002 -0.9898 -0.0364 -0.0027 -0.9925 -0.0242 -0.0050 -0.9929 0.0029 -0.0023 0.9998 -0.133521E-09 0.425567E-01
2 0.30 0.000E+00 0.0193 -0.0068 -0.9884 0.0040 0.0139 -0.9782 -0.0158 0.0150 -0.9814 0.0054 -0.0008 0.9997 -0.134103E-09 0.255356E-01
2 0.40 0.000E+00 -0.0157 0.0183 -0.9879 -0.0315 -0.0311 -0.9908 -0.0314 -0.0160 -0.9929 0.0040 0.0010 0.9998 -0.134819E-09 0.257300E-01
2 0.50 0.000E+00 -0.0402 0.0300 -0.9832 -0.0093 0.0269 -0.9781 -0.0326 0.0247 -0.9802 0.0044 -0.0010 0.9997 -0.131515E-09 0.440350E-01
I attempted to read the data with fread
and used grepl
for removing the rows;
files <-dir(pattern = "*sum.txt",full.names = FALSE)
library(data.table)
fread_files <- function(files){
sum_data_read <- fread(files,skip=2, sep="\t", ) #seperation is tab.
df_grep <- sum_vgm_read [!grepl("TRI",sum_vgm_read$V1),] # for removing the lines that contain "TRIAL" letter in V1 column. But so far there is no V1 column is recognized!!
df <- bind_rows(df_grep) #binding rows after removing
write.table(as.data.table(df),file = gsub("(.*)(\\..*)", "\\1_new\\2", files),row.names = FALSE,col.names = TRUE)
}
and finally lapply
lapply(files, fread_files)
when I perfom this, only one row of data is created as an output which is something going on but I dont know what. Thanks for help in advance!
Firstly, I discovered that grepl function does not work properly since fread makes the data as one column indicated also in this question.
But that question's accepted answer says that problem was fixed in v1.9.6. Which version are you using? That's why we ask you to please state the version number up front, to save time answering.
It is a great example file and the question is great.
I would not try to reinvent the wheel as operations like these have long been implemented as command line tools, which you can use together with fread
directly. The advantage is that you won't churn through R memory, you can leave the filtering to the command tool and that can be much more efficient. For example, if you load all the lines as lines into R, those strings will be cached in R's global string cache (at least temporarily). Doing that filter outside R first will save that cost.
I downloaded your great file and tested the following which works.
> fread("grep -v TRIAL sum_data.txt")
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17
1: 2 0.1 0 -0.0047 -0.0168 -0.9938 -0.0087 -0.0105 -0.9709 0.0035 0.0079 -0.9754 0.0081 0.0023 0.9997 -1.35324e-10 0.0278754
2: 2 0.2 0 -0.0121 0.0002 -0.9898 -0.0364 -0.0027 -0.9925 -0.0242 -0.0050 -0.9929 0.0029 -0.0023 0.9998 -1.33521e-10 0.0425567
3: 2 0.3 0 0.0193 -0.0068 -0.9884 0.0040 0.0139 -0.9782 -0.0158 0.0150 -0.9814 0.0054 -0.0008 0.9997 -1.34103e-10 0.0255356
4: 2 0.4 0 -0.0157 0.0183 -0.9879 -0.0315 -0.0311 -0.9908 -0.0314 -0.0160 -0.9929 0.0040 0.0010 0.9998 -1.34819e-10 0.0257300
5: 2 0.5 0 -0.0402 0.0300 -0.9832 -0.0093 0.0269 -0.9781 -0.0326 0.0247 -0.9802 0.0044 -0.0010 0.9997 -1.31515e-10 0.0440350
---
124247: 250 49.5 0 -0.0040 0.0141 0.9802 -0.0152 0.0203 -0.9877 -0.0015 0.0123 -0.9901 0.0069 0.0003 0.9997 -1.30220e-10 0.0213215
124248: 250 49.6 0 -0.0006 0.0284 0.9819 0.0021 0.0248 -0.9920 0.0264 0.0408 -0.9919 0.0028 -0.0028 0.9997 -1.30295e-10 0.0284142
124249: 250 49.7 0 0.0378 0.0305 0.9779 -0.0261 0.0232 -0.9897 -0.0236 0.0137 -0.9928 0.0102 -0.0023 0.9997 -1.29890e-10 0.0410760
124250: 250 49.8 0 0.0569 -0.0203 0.9800 -0.0028 -0.0009 -0.9906 -0.0139 -0.0169 -0.9918 0.0039 -0.0017 0.9997 -1.31555e-10 0.0513482
124251: 250 49.9 0 0.0234 -0.0358 0.9840 -0.0340 0.0114 -0.9873 -0.0255 0.0134 -0.9888 0.0006 0.0009 0.9997 -1.30862e-10 0.0334976
>
The -v
makes grep
return all lines except lines containing the string TRIAL. Given the number of high quality engineers that have looked at the command tool grep
over the years, it is most likely that it is as fast as you can get, as well as being correct, convenient, well documented online, easy to learn and search for solutions for specific tasks. If you need to do more complicated string filters (e.g. strings at the beginning or the end of the lines, etc) then grep
syntax is very powerful. Learning its syntax is a transferable skill to other languages and environments.
For further examples on the use of command line tools in fread
, you may check the article Convenience features of fread. Please note that "On Windows we recommend Cygwin (run one .exe to install) which includes the command line tools such as grep
".
In order to read a file and remove row based on a string criteria, you could use readLines
function, and filter the result.
I use stringr
package for string manipulation.
library(stringr)
# Read your file by lines
DT <- readLines("sum_data")
length(DT)
#> [1] 124501
# detect which lines contains trial
trial_lines <- str_detect(DT, "TRI")
head(trial_lines)
#> [1] TRUE TRUE FALSE FALSE FALSE FALSE
# Remove those lines
DT <- DT[!trial_lines]
length(DT)
#> [1] 124251
# Rewrite your file by line
writeLines(DT, "new_file")
If you have performance issues, you could try read_lines
from package readr
instead of base readLines
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With