AWK Split File every n-th Row but group IDs together

Question

Lets assume I have the following file text.txt:

@something
@somethingelse
@anotherthing
1
2
2
3
3
3
4
4
4
5
5
6
7
7
8
9
9
9
10
11
11
11
14
15

I want to split this into multiple files by every 5th data row, but if the number of the next row is identical it should still end up in the same file. Header should be in every file, but that could also be ignored and reintroduced later.
This means something like this:

text.txt.1
@something
@somethingelse
@anotherthing
1
2
2
3
3
3

text.txt.2
@something
@somethingelse
@anotherthing
4
4
4
5
5

text.txt.3
@something
@somethingelse
@anotherthing
6
7
7
8
9
9
9

text.txt.4
@something
@somethingelse
@anotherthing
10
11
11
11
14

text.txt.5
@something
@somethingelse
@anotherthing
15

So I was thinking about something like this:

awk 'NR%5==1 && $1!=prev{i++;prev=$1}{print > FILENAME"."i}' test.txt

Both statements work by itself but not together.. is that possible using awk?

Tiw · Accepted Answer

Nice question.
With your example, this would work:

awk 'BEGIN{i=1;}/\@/{header= header == ""? $0 : header "
" $0; next}c>=5 && $1!=prev{i++;c=0;}{if(!c) print header>FILENAME"."i; print > FILENAME"."i;c++;prev=$1;}' test.txt

You need strip the header out, and set a counter (c in above), NR is just current line number of the input, it will not meet your needs when the actual lines are not times of 5.

Break it up and improve a tiny bit:

awk 'BEGIN{i=1;}
  /\@/{header= header == ""? $0 : header ORS $0; next}
  c>=5 && $1!=prev{i++;c=0;}
  !c {print header>FILENAME"."i;}
  {print > FILENAME"."i;c++;prev=$1;}
  ' test.txt

To solve the potential problems mentioned in the comment:

awk 'BEGIN{i=1}
  /\@/{header= header == ""? $0 : header ORS $0; next}
  c>=5 && $1!=prev{i++;c=0}
  !c {close(f);f=(FILENAME"."i);print header>f}
  {print>f;c++;prev=$1}
  ' test.txt

or check Ed's answer which is more precise and different platforms/versions compatible.

Ed Morton · Answer

Using any awk in any shell on every Unix box:

$ cat tst.awk
/^@/ {
    hdr = hdr $0 ORS
    next
}
( (++numLines) % 5 ) == 1 {
    if ( $0 == prev ) {
        --numLines
    }
    else {
        close(out)
        out = FILENAME "." (++numBlocks)
        printf "%s", hdr > out
        numLines = 1
    }
}
{
    print > out
    prev = $0
}

$ awk -f tst.awk text.txt

$ head text.txt.*
==> text.txt.1 <==
@something
@somethingelse
@anotherthing
1
2
2
3
3
3

==> text.txt.2 <==
@something
@somethingelse
@anotherthing
4
4
4
5
5

==> text.txt.3 <==
@something
@somethingelse
@anotherthing
6
7
7
8
9
9
9

==> text.txt.4 <==
@something
@somethingelse
@anotherthing
10
11
11
11
14

==> text.txt.5 <==
@something
@somethingelse
@anotherthing
15

RavinderSingh13 · Answer

With your shown samples, please try following awk program. Written and tested in GNU awk.

awk '
BEGIN{
  outFile="test.txt"
  count=1
}
/@/{
  header=(header?header ORS:"")$0
  next
}
{
  arr[$0]=(arr[$0]?arr[$0] ORS:"")$0
}
END{
  PROCINFO["sorted_in"] = "@ind_num_asc"
  print header > (outFile count)
  for(i in arr){
    num=split(arr[i],arr2,"
")
    print arr[i] > (outFile count)
    len+=num
    if(len>=5){ len=0 }
    if(len==0){
      close(outFile count)
      count++
      print header > (outFile count)
    }
  }
}
'  Input_file

AWK Split File every n-th Row but group IDs together

Tags:

unix

split

awk

Borderline

3 Answers

Tiw

Ed Morton

RavinderSingh13

Recent Activity

Donate For Us

AWK Split File every n-th Row but group IDs together

Tags:

unix

split

awk

Borderline

3 Answers

Tiw

Ed Morton

RavinderSingh13

Related questions

Recent Activity

Donate For Us