Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sort lines in text file between patterns

Tags:

python

sed

awk

I am trying to sort lines between patterns in Bash or in Python. I would like to sort the lines based on the second field with "," as delimiter.

Given the following text input file:

Sample1
T1,64,0.65  MEDIUM
T2,60,0.45  LOW
T3,301,0.68  MEDIUM
T4,65,0.75  HIGH
T5,59,0.72  MEDIUM
T6,51,0.82  HIGH
Sample2
T1,153,0.77  HIGH
T2,152,0.61  MEDIUM
T3,154,0.67  MEDIUM
T4,283,0.66  MEDIUM
T5,161,0.65  MEDIUM
Sample3
T1,147,0.71  MEDIUM
T2,154,0.63  MEDIUM
T3,45,0.63  MEDIUM
T4,259,0.77  HIGH

I expect as output:

Sample1
T6,51,0.82  HIGH
T5,59,0.72  MEDIUM
T2,60,0.45  LOW
T1,64,0.65  MEDIUM
T4,65,0.75  HIGH
T3,301,0.68  MEDIUM
Sample2
T2,152,0.61  MEDIUM
T1,153,0.77  HIGH
T3,154,0.67  MEDIUM
T5,161,0.65  MEDIUM
T4,283,0.66  MEDIUM
Sample3
T3,45,0.63  MEDIUM
T1,147,0.71  MEDIUM
T2,154,0.63  MEDIUM
T4,259,0.77  HIGH

I have tried to adapt this suggestion by glenn jackman found in another post but it only works for 2 pattern as far as I tested:

> gawk -v cmd="sort -k2" p=1 '
>     /^PATTERN2/ {          # when we we see the 2nd marker:
>         close("cmd", "to");
>         while (("cmd" |& getline line) >0) print line 
>         p=1
>     }
>     p  {print}             # if p is true, print the line
>     !p {print |& "cmd"}   # if p is false, send the line to `sort`
>     /^PATTERN1/ {p=0}      # when we see the first marker, turn off printing ' FILE
like image 359
Lfm_ Avatar asked Nov 14 '19 10:11

Lfm_


2 Answers

You can do this with GNU awk in the following way:

$ awk 'BEGIN{PROCINFO["sorted_in"]="@val_num_asc"; FS=","}
       /PATTERN/{
         for(i in a) print i
         delete a
         print; next
       }
       { a[$0]=$2 }
       END{ for(i in a) print i }' file

With PROCINFO["sorted_in"]="@val_num_asc", we tell GNU awk to traverse the arrays in a way that the values of the array elements appear in numerical ascending order. The idea is to make an array with key the full line and value the second field. We don't use the second field as key as there might be duplicates. This could still be achieved however in the following way:

$ awk 'BEGIN{PROCINFO["sorted_in"]="@val_num_asc"; FS=","}
       /PATTERN/{
         for(i in a) print a[i]
         delete a
         print; next
       }
       ($2 in a){ a[$2]=a[$2] ORS $0; next }
       { a[$2] = $0 }
       END{ for(i in a) print a[i] }' file
like image 88
kvantour Avatar answered Oct 16 '22 13:10

kvantour


Please see the function below.

def sort_lines_by_second_field(source_filename: str, destination_filename: str):
    with open(source_filename) as source:
        lines = source.readlines()
        lines.sort(key=lambda row: int(row.split(',')[1]))
        with open(destination_filename, "w") as destination:
            destination.writelines(lines)

It reads all lines, sort them by second field which is cast to the integer at first and then saves them to the target file.

like image 22
Piotr Grzybowski Avatar answered Oct 16 '22 11:10

Piotr Grzybowski