I have many (near 100) big csv files with sellID in first column. I know that some sellID are repeated 2 or more times in 2 or more files. Is possible with grep find all this duplicate sellID (create map sellID-file_name)? Or exists another open source application for this purpose? My OS - CentOS.
Here's a very simple, somewhat crude awk script to accomplish something pretty close to what you seem to be describing:
#!/usr/bin/awk -f
{ if ($1 in seenbefore) {
printf( "%s\t%s\n", $1, seenbefore[$1]);
printf( "%s\t%s\n", $1, FILENAME);
}
seenbefore[$1]=FILENAME;
}
As you can hopefully surmise all we're doing is building an associative array of each value you find in the first column/field (set FS in the BEGIN special block to change the input field separator ... for a trivially naive form of CSV support). As we encounter any duplicate we print out the dupe, the file we previously saw it in and the current filename. In any event we then add/update the array with the current file's name.
With more code you could store and print the line numbers of each, append filename/line number tuples to a list and move all the output to an END block where you summarize it in some a more concise format, and so on.
For any of that I'd personally shift to Python where the data types are richer (actual lists and tuples rather than having to concatenate them into strings or built and array of arrays) and I'd have access to much more power (an actual CSV parser which can handle various flavors of quoted CSV and alternative delimiters, and where producing sorted results is trivially easy).
However, this should, hopefully, get you on the right track.
Related question: https://serverfault.com/questions/66301/removing-duplicate-lines-from-file-with-grep
You could cat all the files in a single one, and then look for dupes as suggested in the link above.
BTW, it is not clear if you want to keep only the dupes or remove them.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With