Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWK Split File every n-th Row but group IDs together

Tags:

unix

split

awk

Lets assume I have the following file text.txt:

@something
@somethingelse
@anotherthing
1
2
2
3
3
3
4
4
4
5
5
6
7
7
8
9
9
9
10
11
11
11
14
15

I want to split this into multiple files by every 5th data row, but if the number of the next row is identical it should still end up in the same file. Header should be in every file, but that could also be ignored and reintroduced later.
This means something like this:

text.txt.1
@something
@somethingelse
@anotherthing
1
2
2
3
3
3

text.txt.2
@something
@somethingelse
@anotherthing
4
4
4
5
5

text.txt.3
@something
@somethingelse
@anotherthing
6
7
7
8
9
9
9

text.txt.4
@something
@somethingelse
@anotherthing
10
11
11
11
14

text.txt.5
@something
@somethingelse
@anotherthing
15

So I was thinking about something like this:

awk 'NR%5==1 && $1!=prev{i++;prev=$1}{print > FILENAME"."i}' test.txt

Both statements work by itself but not together.. is that possible using awk?

like image 482
Borderline Avatar asked Jul 20 '21 13:07

Borderline


3 Answers

Nice question.
With your example, this would work:

awk 'BEGIN{i=1;}/\@/{header= header == ""? $0 : header "\n" $0; next}c>=5 && $1!=prev{i++;c=0;}{if(!c) print header>FILENAME"."i; print > FILENAME"."i;c++;prev=$1;}' test.txt

You need strip the header out, and set a counter (c in above), NR is just current line number of the input, it will not meet your needs when the actual lines are not times of 5.

Break it up and improve a tiny bit:

awk 'BEGIN{i=1;}
  /\@/{header= header == ""? $0 : header ORS $0; next}
  c>=5 && $1!=prev{i++;c=0;}
  !c {print header>FILENAME"."i;}
  {print > FILENAME"."i;c++;prev=$1;}
  ' test.txt

To solve the potential problems mentioned in the comment:

awk 'BEGIN{i=1}
  /\@/{header= header == ""? $0 : header ORS $0; next}
  c>=5 && $1!=prev{i++;c=0}
  !c {close(f);f=(FILENAME"."i);print header>f}
  {print>f;c++;prev=$1}
  ' test.txt

or check Ed's answer which is more precise and different platforms/versions compatible.

like image 189
Tiw Avatar answered Oct 24 '22 17:10

Tiw


Using any awk in any shell on every Unix box:

$ cat tst.awk
/^@/ {
    hdr = hdr $0 ORS
    next
}
( (++numLines) % 5 ) == 1 {
    if ( $0 == prev ) {
        --numLines
    }
    else {
        close(out)
        out = FILENAME "." (++numBlocks)
        printf "%s", hdr > out
        numLines = 1
    }
}
{
    print > out
    prev = $0
}

$ awk -f tst.awk text.txt

$ head text.txt.*
==> text.txt.1 <==
@something
@somethingelse
@anotherthing
1
2
2
3
3
3

==> text.txt.2 <==
@something
@somethingelse
@anotherthing
4
4
4
5
5

==> text.txt.3 <==
@something
@somethingelse
@anotherthing
6
7
7
8
9
9
9

==> text.txt.4 <==
@something
@somethingelse
@anotherthing
10
11
11
11
14

==> text.txt.5 <==
@something
@somethingelse
@anotherthing
15
like image 20
Ed Morton Avatar answered Oct 24 '22 16:10

Ed Morton


With your shown samples, please try following awk program. Written and tested in GNU awk.

awk '
BEGIN{
  outFile="test.txt"
  count=1
}
/@/{
  header=(header?header ORS:"")$0
  next
}
{
  arr[$0]=(arr[$0]?arr[$0] ORS:"")$0
}
END{
  PROCINFO["sorted_in"] = "@ind_num_asc"
  print header > (outFile count)
  for(i in arr){
    num=split(arr[i],arr2,"\n")
    print arr[i] > (outFile count)
    len+=num
    if(len>=5){ len=0 }
    if(len==0){
      close(outFile count)
      count++
      print header > (outFile count)
    }
  }
}
'  Input_file
like image 44
RavinderSingh13 Avatar answered Oct 24 '22 17:10

RavinderSingh13