Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding duplicates, and uniques of the duplicates in a csv

I'm needing to create a script, that will load a csv (sometimes tagged as .inf) to memory, and evaluate the data for a type of duplicate. The csv itself will always have different information in every field, but the columns will be the same. Around 100~ columns. In my examples, i'm going to narrow it to 10 columns for readability.

The "type" of duplicate i'm looking for is a bit odd. I need to first find all duplicates in column 2. Then i need to look at that set of duplicates, and look at column 8 (In my actual csv, it will be column 84). Looking at column 8 i need to only output data that is:

A. Duplicate in column 2

B. Unique in column 8

There may only be 2 duplicates from column 2, and their column 8 is the same. I don't need to see that. If there are 3 duplicates from column 2, and their column 8, 2 are the same, and 1 is unique, i need to see all 3 FULL rows.

Desired input
m,123veh,john;doe,10/1/2019,ryzen,split,32929,38757ace,turn,left
m,123veh,john;doe,10/1/2019,ryzen,split,32929,495842,turn,left
m,837iec,john;doe,10/1/2019,ryzen,split,32929,12345,turn,left
m,837iec,john;doe,10/1/2019,ryzen,split,32929,12345,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,38757,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,38757,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,4978d87,turn,left

This data is constantly going to be different, and even the number of characters in column 8 may vary.

Desired output
m,123veh,john;doe,10/1/2019,ryzen,split,32929,38757ace,turn,left
m,123veh,john;doe,10/1/2019,ryzen,split,32929,495842,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,38757,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,38757,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,4978d87,turn,left 

You can see from my desired output, i don't need to see the rows with 837iec because, while their column 2 is duplicate, both the columns 8 match each other. I don't need to see that. While for something like 382ork, 2 of the column 8s match, one is unique. I need to see all 3.

I'll be using this on unix systems, and my desired way of using it would be to type "./scriptname filename.csv" and the output can either be standard output, or into a log file if needed.

I haven't been able to find a way to do this as how i'm needing to compare column 8 is confusing me. Any help would be greatly appreciated.

I had found this in another thread that at least gets me the full rows, of the duplicates of column 2. Thought i don't fully understand how its working.

#!/usr/bin/awk -f
{
    lines[$1][NR] = $0;
}
END {
    for (vehid in lines) {
        if (length(lines[vehid]) > 1) {
            for (lineno in lines[vehid]) {
                # Print duplicate line for decision purposes
                print lines[vehid][lineno];
                # Alternative: print line number and line
                #print lineno, lines[vehid][lineno];
            }
        }
    }
}

The issue i have is it doesn't take the next column into consideration. It also does not handle blank columns well. My csv will have 100~ columns, and 50~ of them might be entirely blank.

like image 580
T_Wrong Avatar asked Dec 03 '19 19:12

T_Wrong


People also ask

How do I check for duplicates in CSV?

Find Duplicates To find duplicate values in a column, click the column header and select Histogram. This will count how many many times each value appears in the dataset. You can search the Histogram for values that show up more than once.

How do I find duplicate values in a CSV file in Python?

Method 1: Read the csv file and pass it into the data frame. Then, identify the duplicate rows using the duplicated() function. Finally, use the print statement to display the duplicate rows.

How do I find unique duplicates in Excel?

In Excel, there are several ways to filter for unique values—or remove duplicate values: To filter for unique values, click Data > Sort & Filter > Advanced. To remove duplicate values, click Data > Data Tools > Remove Duplicates.

How do I find duplicates in a csv file in Unix?

The uniq command in Linux is used to display identical lines in a text file. This command can be helpful if you want to remove duplicate words or strings from a text file. Since the uniq command matches adjacent lines for finding redundant copies, it only works with sorted text files.


2 Answers

Could you please try following.

awk '
BEGIN{
  FS=","
}
FNR==NR{
  a[$2]++
  b[$2,$8]++
  c[$2]=(c[$2]?c[$2] ORS:"")$0
  next
}
a[$2]>1 && b[$2,$8]==1{
  print c[$2]
  delete a[$2]
}' <(sort -t',' -k2 Input_file) <(sort -t',' -k2 Input_file)

With your shown example output is as follows.

m,123veh,john;doe,10/1/2019,ryzen,split,32929,38757ace,turn,left
m,123veh,john;doe,10/1/2019,ryzen,split,32929,495842,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,38757,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,38757,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,4978d87,turn,left

Explanation: Adding detailed explanation for above code.

awk '                                                     ##Starting awk program from here.
BEGIN{                                                    ##Starting BEGIN section from here.
  FS=","                                                  ##Setting FS as comma here.
}                                                         ##Closing BEGIN section here.
FNR==NR{                                                  ##Checking condition FNR==NR which will be TRUE when first time Input_file is being read.
  a[$2]++                                                 ##Creating an array named a whose index is $2 and increment its value with 1 each time it comes here.
  b[$2,$8]++                                              ##Creating an array named b whose index is $2,$8 and increment its value with 1 each time it comes here.
  c[$2]=(c[$2]?c[$2] ORS:"")$0                            ##Creating an array named c whose index is $2 and value will be keep concatenating its same indexs whole line value.
  next                                                    ##next will skip all further statements from here.
}                                                         ##Closing BLOCK for FNR==NR condition here.
a[$2]>1 && b[$2,$8]==1{                                   ##Checking condition if array a with index $2 value is greater than 1 AND array b with index $2,$8 value is 1.
  print c[$2]                                             ##Then print array c value with $2 here.
  delete a[$2]                                            ##Deleting array a value with $2 here which will make sure NO DUPLICATE lines are getting printed.
}' <(sort -t',' -k2 file) <(sort -t',' -k2 file)          ##Sending Input_files in sorted format from 2nd field to make sure all values are coming together before doing operations on it.
like image 68
RavinderSingh13 Avatar answered Oct 02 '22 08:10

RavinderSingh13


A Python solution to this problem could be (here I used _id and qty for the 2 captured fields):

import csv
from collections import defaultdict

f = open('f1.txt', 'r')
d = defaultdict(lambda: defaultdict(list))

csv_reader = csv.reader(f)

for row in csv_reader:
    _id = row[1]
    qty = row[7]
    d[_id][qty].append(row)

f.close()

for _id in d:
    for qty in d[_id]:
        # if there are more than 1 'qty'
        # OR there is only 1 'qty' and there is only 1 line in the values
        # for the array (row) (allows a record with only 1 line)
        if len(d[_id]) > 1 or len(d[_id][qty]) == 1:
            for row in d[_id][qty]:
                print(','.join(row))

Prints:

m,123veh,john;doe,10/1/2019,ryzen,split,32929,38757ace,turn,left
m,123veh,john;doe,10/1/2019,ryzen,split,32929,495842,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,38757,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,38757,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,4978d87,turn,left
like image 42
Chris Charley Avatar answered Oct 02 '22 06:10

Chris Charley