I'm needing to create a script, that will load a csv (sometimes tagged as .inf) to memory, and evaluate the data for a type of duplicate. The csv itself will always have different information in every field, but the columns will be the same. Around 100~ columns. In my examples, i'm going to narrow it to 10 columns for readability.
The "type" of duplicate i'm looking for is a bit odd. I need to first find all duplicates in column 2. Then i need to look at that set of duplicates, and look at column 8 (In my actual csv, it will be column 84). Looking at column 8 i need to only output data that is:
A. Duplicate in column 2
B. Unique in column 8
There may only be 2 duplicates from column 2, and their column 8 is the same. I don't need to see that. If there are 3 duplicates from column 2, and their column 8, 2 are the same, and 1 is unique, i need to see all 3 FULL rows.
Desired input
m,123veh,john;doe,10/1/2019,ryzen,split,32929,38757ace,turn,left
m,123veh,john;doe,10/1/2019,ryzen,split,32929,495842,turn,left
m,837iec,john;doe,10/1/2019,ryzen,split,32929,12345,turn,left
m,837iec,john;doe,10/1/2019,ryzen,split,32929,12345,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,38757,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,38757,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,4978d87,turn,left
This data is constantly going to be different, and even the number of characters in column 8 may vary.
Desired output
m,123veh,john;doe,10/1/2019,ryzen,split,32929,38757ace,turn,left
m,123veh,john;doe,10/1/2019,ryzen,split,32929,495842,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,38757,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,38757,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,4978d87,turn,left
You can see from my desired output, i don't need to see the rows with 837iec because, while their column 2 is duplicate, both the columns 8 match each other. I don't need to see that. While for something like 382ork, 2 of the column 8s match, one is unique. I need to see all 3.
I'll be using this on unix systems, and my desired way of using it would be to type "./scriptname filename.csv" and the output can either be standard output, or into a log file if needed.
I haven't been able to find a way to do this as how i'm needing to compare column 8 is confusing me. Any help would be greatly appreciated.
I had found this in another thread that at least gets me the full rows, of the duplicates of column 2. Thought i don't fully understand how its working.
#!/usr/bin/awk -f
{
lines[$1][NR] = $0;
}
END {
for (vehid in lines) {
if (length(lines[vehid]) > 1) {
for (lineno in lines[vehid]) {
# Print duplicate line for decision purposes
print lines[vehid][lineno];
# Alternative: print line number and line
#print lineno, lines[vehid][lineno];
}
}
}
}
The issue i have is it doesn't take the next column into consideration. It also does not handle blank columns well. My csv will have 100~ columns, and 50~ of them might be entirely blank.
Find Duplicates To find duplicate values in a column, click the column header and select Histogram. This will count how many many times each value appears in the dataset. You can search the Histogram for values that show up more than once.
Method 1: Read the csv file and pass it into the data frame. Then, identify the duplicate rows using the duplicated() function. Finally, use the print statement to display the duplicate rows.
In Excel, there are several ways to filter for unique values—or remove duplicate values: To filter for unique values, click Data > Sort & Filter > Advanced. To remove duplicate values, click Data > Data Tools > Remove Duplicates.
The uniq command in Linux is used to display identical lines in a text file. This command can be helpful if you want to remove duplicate words or strings from a text file. Since the uniq command matches adjacent lines for finding redundant copies, it only works with sorted text files.
Could you please try following.
awk '
BEGIN{
FS=","
}
FNR==NR{
a[$2]++
b[$2,$8]++
c[$2]=(c[$2]?c[$2] ORS:"")$0
next
}
a[$2]>1 && b[$2,$8]==1{
print c[$2]
delete a[$2]
}' <(sort -t',' -k2 Input_file) <(sort -t',' -k2 Input_file)
With your shown example output is as follows.
m,123veh,john;doe,10/1/2019,ryzen,split,32929,38757ace,turn,left
m,123veh,john;doe,10/1/2019,ryzen,split,32929,495842,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,38757,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,38757,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,4978d87,turn,left
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here.
FS="," ##Setting FS as comma here.
} ##Closing BEGIN section here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first time Input_file is being read.
a[$2]++ ##Creating an array named a whose index is $2 and increment its value with 1 each time it comes here.
b[$2,$8]++ ##Creating an array named b whose index is $2,$8 and increment its value with 1 each time it comes here.
c[$2]=(c[$2]?c[$2] ORS:"")$0 ##Creating an array named c whose index is $2 and value will be keep concatenating its same indexs whole line value.
next ##next will skip all further statements from here.
} ##Closing BLOCK for FNR==NR condition here.
a[$2]>1 && b[$2,$8]==1{ ##Checking condition if array a with index $2 value is greater than 1 AND array b with index $2,$8 value is 1.
print c[$2] ##Then print array c value with $2 here.
delete a[$2] ##Deleting array a value with $2 here which will make sure NO DUPLICATE lines are getting printed.
}' <(sort -t',' -k2 file) <(sort -t',' -k2 file) ##Sending Input_files in sorted format from 2nd field to make sure all values are coming together before doing operations on it.
A Python solution to this problem could be (here I used _id
and qty
for the 2 captured fields):
import csv
from collections import defaultdict
f = open('f1.txt', 'r')
d = defaultdict(lambda: defaultdict(list))
csv_reader = csv.reader(f)
for row in csv_reader:
_id = row[1]
qty = row[7]
d[_id][qty].append(row)
f.close()
for _id in d:
for qty in d[_id]:
# if there are more than 1 'qty'
# OR there is only 1 'qty' and there is only 1 line in the values
# for the array (row) (allows a record with only 1 line)
if len(d[_id]) > 1 or len(d[_id][qty]) == 1:
for row in d[_id][qty]:
print(','.join(row))
Prints:
m,123veh,john;doe,10/1/2019,ryzen,split,32929,38757ace,turn,left
m,123veh,john;doe,10/1/2019,ryzen,split,32929,495842,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,38757,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,38757,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,4978d87,turn,left
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With