Hi I have a ton of data in multiple csv files and filter out a data set using grep:
user@machine:~/$ cat data.csv | grep -a "63[789]\...;"
637.05;1450.2
637.32;1448.7
637.60;1447.7
637.87;1451.5
638.14;1454.2
638.41;1448.6
638.69;1445.8
638.96;1440.0
639.23;1431.9
639.50;1428.8
639.77;1427.3
I want to figure out the data set which has the highest count, the column right of the ; and then know the corresponding value (left of the ;). In this case the set I'm looking for would be 638.14;1454.2
I tried different things and ended up using a combination of bash and python, which works, but isn't very pretty:
os.system('ls | grep csv > filelist')
files = open("filelist")
files = files.read()
files = files.split("\n")
for filename in files[0:-1]:
os.system('cat ' + filename + ' | grep -a "63[6789]\...;" > filtered.csv')
filtered = csv.reader(open('filtered.csv'), delimiter=';')
sortedlist = sorted(filtered_file, key=operator.itemgetter(1), reverse=True)
dataset = sortedlist[0][0] + ';' + sortedlist[0][1] + '\n'
I would love to have a bash only solution (cut, awk, arrays?!?) but couldn't figure it out. Also I don't like the work around writing the bash commands into files and then reading them into python variables. Can I read them into variables directly or are there better solutions to this problem? (probably perl etc... but I am really interested in a bash solution..)
Thank you very much!!
A quick one-liner would be:
grep -a "63[789]\...;" data.csv | sort -n -r -t ';' -k 2 | head --lines=1
This simply sorts the file numerically based on the second column and then prints out the first row. Hope that helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With