Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Subsetting Data with GREP

I have a very large text file (16GB) that I want to subset as fast as possible. Here is a sample of the data involved

0   M   4   0   
0   0   Q   0   10047345    3080290,4098689 50504886,4217515    9848058,1084315 50534229,4217515    50591618,4217515    26242582,2597528    34623075,3279130    68893581,5149883    50628761,4217517    32262001,3142702    35443881,3339757
0   108 C   0   50628761
0   1080    C   0   50628761
1   M   7   0
1   0   Q   0   17143989    
2   M   15  1   
2   0   Q   0   17143989    4219157,1841361,853923,1720163,1912374,1755325,4454730  65548702,4975721    197782,39086    54375043,4396765    31589696,3091097    6876504,851594  3374640,455375  13274885,1354902    31585771,3091016    61234218,4723345    31583582,3091014
2   27  C   0   31589696

The first number on every line is a sessionID and any line with an 'M' denotes the start of a session (data is grouped by session). The number following an M is a Day and the second number is a userID, a user can have multiple sessions.

I want to extract all lines related to a specific user which for each session include all of the lines up until the next 'M' line is encountered (can be any number of lines). As a second task I also want to extract all session lines related to a specific day.

For example with the above data, to extract the records for userid '0' the output would be:

0   M   4   0   
0   0   Q   0   10047345    3080290,4098689 50504886,4217515    9848058,1084315 50534229,4217515    50591618,4217515    26242582,2597528    34623075,3279130    68893581,5149883    50628761,4217517    32262001,3142702    35443881,3339757
0   108 C   0   50628761
0   1080    C   0   50628761
1   M   7   0
1   0   Q   0   17143989    

To extract the records for day 7 the output would be:

1   M   7   0
1   0   Q   0   17143989    

I believe there is a much more elegant and simple solution to what I have achieved so far and it would be great to get some feedback and suggestions. Thank you.

What I have tried

I tried to use pcrgrep -M to apply this pattern directly (matching data between two M's) but struggled to get this working across the linebreaks. I still suspect this may be the fastest option so any guidance on whether this may be possible would be great.

The next part is quite scattered and it is not necessary to read on if you already have an idea for a better solution!

Failing the above, I split the problem into two parts:

  • Part 1: Isolating all 'M' lines to obtain a list of sessions which belonging to that user/day

    1. grep method is fast (then need to figure out how to use this data)

      time grep -c "M\t.*\t$user_id" trainSample.txt >> sessions.txt

    2. awk method to create an array is slow

      time myarr=$(awk '/M\t.*\t$user_id/ {print $1}' trainSample.txt

  • Part 2: Extracting all lines belonging to a session on the list created in part 1

    1. Continuing from the awk method, I ran grep for each but this is WAY too slow (days to complete 16GB)

      for i in "${!myarr[@]}"; 
      do 
        grep "^${myarr[$i]}\t" trainSample.txt >> sessions.txt
        echo -ne "Session $i\r"
      done
      
    2. Instead of running grep once per session ID as above using them all in the one grep command is MUCH faster (I ran it with 8 sessionIDs in a [1|2|3|..|8] format and it took the same time as each did separately i.e. 8X faster). However I need then to figure out how to do this dynamically

Update

I have actually established a working solution which only takes seconds to complete but it is some messy and inflexible bash coe which I have yet to extend to the second (isolating by days) case.

like image 669
user2071737 Avatar asked Mar 09 '26 13:03

user2071737


1 Answers

I want to extract all lines related to a specific user which for each session include all of the lines up until the next 'M' line is encountered (can be any number of lines).

$ awk '$2=="M"{p=$4==0}p' file
0   M   4   0   
0   0   Q   0   10047345    3080290,4098689 50504886,4217515    9848058,1084315 50534229,4217515    50591618,4217515    26242582,2597528    34623075,3279130    68893581,5149883    50628761,4217517    32262001,3142702    35443881,3339757
0   108 C   0   50628761
0   1080    C   0   50628761
1   M   7   0
1   0   Q   0   17143989 

As a second task I also want to extract all session lines related to a specific day.

$ awk '$2=="M"{p=$3==7}p' file
1   M   7   0
1   0   Q   0   17143989 
like image 100
user000001 Avatar answered Mar 12 '26 03:03

user000001