Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sort | uniq | xargs grep ... where lines contain spaces

I have a comma delimited file "myfile.csv" where the 5th column is a date/time stamp. (mm/dd/yyyy hh:mm). I need to list all the rows that contain duplicate dates (there are lots)

I'm using a bash shell via cygwin for WinXP

$ cut -d, -f 5 myfile.csv | sort | uniq -d 

correctly returns a list of the duplicate dates

01/01/2005 00:22
01/01/2005 00:37
[snip]    
02/29/2009 23:54

But I cannot figure out how to feed this to grep to give me all the rows. Obviously, I can't use xargs straight up since the output contains spaces. I thought I could do uniq -z -d but for some reason, combining those flags causes uniq to (apparently) return nothing.

So, given that

 $ cut -d, -f 5 myfile.csv | sort | uniq -d -z | xargs -0 -I {} grep '{}' myfile.csv

doesn't work... what can I do?

I know that I could do this in perl or another scripting language... but my stubborn nature insists that I should be able to do it in bash using standard commandline tools like sort, uniq, find, grep, cut, etc.

Teach me, oh bash gurus. How can I get the list of rows I need using typical cli tools?

like image 383
Sukotto Avatar asked Dec 08 '22 08:12

Sukotto


1 Answers

  1. sort -k5,5 will do the sort on fields and avoid the cut;
  2. uniq -f 4 will ignore the first 4 fields for the uniq;
  3. Plus a -D on the uniq will get you all of the repeated lines (vs -d, which gets you just one);
  4. but uniq will expect tab-delimited instead of csv, so tr '\t' ',' to fix that.

Problem is if you have fields after #5 that are different. Are your dates all the same length? You might be able to add a -w 16 (to include time), or -w 10 (for just dates), to the uniq.

So:

tr '\t' ',' < myfile.csv | sort -k5,5 | uniq -f 4 -D -w 16
like image 163
Andrew Barnett Avatar answered Dec 28 '22 03:12

Andrew Barnett