Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove a subcolumn from a nested-csv file?

Given a space separated file as such:

0.0 1:0.000000 2:1.000000 3:0.000000 4:0.000000 5:1.000000 6:0.000000 7:1.000000 8:1.000000 9:1.000000 10:1.000000 11:1.000000 12:1.000000 13:1.000000 14:1.000000 15:0.919033 16:1.000000 17:1.000000 18:1.000000 19:1.000000 20:0.000000 21:0.037771
0.0 1:0.000000 2:1.000000 3:0.000000 4:0.000000 5:1.000000 6:0.000000 7:1.000000 8:0.800000 9:0.666667 10:1.000000 11:0.800000 12:0.666667 13:1.000000 14:0.875000 15:0.874574 16:0.848662 17:0.901802 18:0.938795 19:0.903077 20:0.333332 21:0.196682
0.0 1:1.098612 2:1.000000 3:1.000000 4:0.000000 5:1.000000 6:0.000000 7:1.000000 8:0.800000 9:0.500000 10:0.000000 11:0.800000 12:0.500000 13:0.000000 14:0.909091 15:0.780985 16:0.792052 17:0.865396 18:0.863982 19:0.832962 20:0.000000 21:0.069470
0.0 1:0.000000 2:1.000000 3:0.000000 4:0.000000 5:1.000000 6:0.000000 7:1.000000 8:0.923077 9:0.909091 10:0.888889 11:0.923077 12:0.909091 13:0.888889 14:0.943396 15:0.923562 16:0.923871 17:0.949357 18:0.950790 19:0.944919 20:0.142857 21:0.140054

The first columns are all 0.0 and we want to throw that first column away. Then for each column, there's a colon separating the key from its value. And the goal is only to keep the value.

I can do it as such in python:

with io.open(infile, 'r') as fin:
    for line in fin:
        line = line.split()[1:]
        line = '\t'.join([i.split(':')[1] for i in line])
        print line

[out]:

0.000000    1.000000    0.000000    0.000000    1.000000    0.000000    1.000000    1.000000    1.000000    1.000000    1.0000001.000000    1.000000    1.000000    0.919033    1.000000    1.000000    1.000000    1.000000    0.000000    0.037771
0.000000    1.000000    0.000000    0.000000    1.000000    0.000000    1.000000    0.800000    0.666667    1.000000    0.8000000.666667    1.000000    0.875000    0.874574    0.848662    0.901802    0.938795    0.903077    0.333332    0.196682
1.098612    1.000000    1.000000    0.000000    1.000000    0.000000    1.000000    0.800000    0.500000    0.000000    0.8000000.500000    0.000000    0.909091    0.780985    0.792052    0.865396    0.863982    0.832962    0.000000    0.069470
0.000000    1.000000    0.000000    0.000000    1.000000    0.000000    1.000000    0.923077    0.909091    0.888889    0.9230770.909091    0.888889    0.943396    0.923562    0.923871    0.949357    0.950790    0.944919    0.142857    0.140054

But how is the same possible on the unix command line? (maybe with sed, awk, perl -c or even python -c or anything else) Imagine it's a large file, so please don't load the whole file it memory, unless there's an economical reason for that.

like image 513
alvas Avatar asked Mar 14 '23 11:03

alvas


2 Answers

With GNU sed:

sed 's/^0.0 //;s/[0-9]\+:\([0-9.]\+\)/\1/g' file

Output:

0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.919033 1.000000 1.000000 1.000000 1.000000 0.000000 0.037771
0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 1.000000 0.800000 0.666667 1.000000 0.800000 0.666667 1.000000 0.875000 0.874574 0.848662 0.901802 0.938795 0.903077 0.333332 0.196682
1.098612 1.000000 1.000000 0.000000 1.000000 0.000000 1.000000 0.800000 0.500000 0.000000 0.800000 0.500000 0.000000 0.909091 0.780985 0.792052 0.865396 0.863982 0.832962 0.000000 0.069470
0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 1.000000 0.923077 0.909091 0.888889 0.923077 0.909091 0.888889 0.943396 0.923562 0.923871 0.949357 0.950790 0.944919 0.142857 0.140054

If you want to edit your file "in place" add sed's option -i.


Two sed scripts separated with ;:

s/^0.0 //: search from start of line (^) for 0.0 followed by a whitespace and replace it by nothing

s/[0-9]\+:\([0-9.]\+\)/\1/g: search for any at least one (+) character from range 0 to 9 followed by : and followed by at least one (+) character from range 0 to 9 or a . and replace it by matching part in round brackets. \1 is the back references to matching part in round brackets. g means global to apply the replacement to all matches to the regexp, not just the first. It is necessary to escape special characters (here: +, (, )) for sed with a \.

Shorter version:

sed 's/^0.0 //;s/[0-9]\+://g' file

See: The Stack Overflow Regular Expressions FAQ

like image 170
Cyrus Avatar answered Mar 20 '23 00:03

Cyrus


awk to the rescue!

$ awk '{gsub("[^ ]*:","");sub("[^ ]* ","")}1' file 

0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.919033 1.000000 1.000000 1.000000 1.000000 0.000000 0.037771
0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 1.000000 0.800000 0.666667 1.000000 0.800000 0.666667 1.000000 0.875000 0.874574 0.848662 0.901802 0.938795 0.903077 0.333332 0.196682
1.098612 1.000000 1.000000 0.000000 1.000000 0.000000 1.000000 0.800000 0.500000 0.000000 0.800000 0.500000 0.000000 0.909091 0.780985 0.792052 0.865396 0.863982 0.832962 0.000000 0.069470
0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 1.000000 0.923077 0.909091 0.888889 0.923077 0.909091 0.888889 0.943396 0.923562 0.923871 0.949357 0.950790 0.944919 0.142857 0.140054
like image 23
karakfa Avatar answered Mar 19 '23 23:03

karakfa