Finding duplicates, and uniques of the duplicates in a csv

Tags:

I'm needing to create a script, that will load a csv (sometimes tagged as .inf) to memory, and evaluate the data for a type of duplicate. The csv itself will always have different information in every field, but the columns will be the same. Around 100~ columns. In my examples, i'm going to narrow it to 10 columns for readability.

The "type" of duplicate i'm looking for is a bit odd. I need to first find all duplicates in column 2. Then i need to look at that set of duplicates, and look at column 8 (In my actual csv, it will be column 84). Looking at column 8 i need to only output data that is:

A. Duplicate in column 2

B. Unique in column 8

There may only be 2 duplicates from column 2, and their column 8 is the same. I don't need to see that. If there are 3 duplicates from column 2, and their column 8, 2 are the same, and 1 is unique, i need to see all 3 FULL rows.

Desired input
m,123veh,john;doe,10/1/2019,ryzen,split,32929,38757ace,turn,left
m,123veh,john;doe,10/1/2019,ryzen,split,32929,495842,turn,left
m,837iec,john;doe,10/1/2019,ryzen,split,32929,12345,turn,left
m,837iec,john;doe,10/1/2019,ryzen,split,32929,12345,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,38757,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,38757,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,4978d87,turn,left

This data is constantly going to be different, and even the number of characters in column 8 may vary.

Desired output
m,123veh,john;doe,10/1/2019,ryzen,split,32929,38757ace,turn,left
m,123veh,john;doe,10/1/2019,ryzen,split,32929,495842,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,38757,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,38757,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,4978d87,turn,left

You can see from my desired output, i don't need to see the rows with 837iec because, while their column 2 is duplicate, both the columns 8 match each other. I don't need to see that. While for something like 382ork, 2 of the column 8s match, one is unique. I need to see all 3.

I'll be using this on unix systems, and my desired way of using it would be to type "./scriptname filename.csv" and the output can either be standard output, or into a log file if needed.

I haven't been able to find a way to do this as how i'm needing to compare column 8 is confusing me. Any help would be greatly appreciated.

I had found this in another thread that at least gets me the full rows, of the duplicates of column 2. Thought i don't fully understand how its working.

#!/usr/bin/awk -f
{
    lines[$1][NR] = $0;
}
END {
    for (vehid in lines) {
        if (length(lines[vehid]) > 1) {
            for (lineno in lines[vehid]) {
                # Print duplicate line for decision purposes
                print lines[vehid][lineno];
                # Alternative: print line number and line
                #print lineno, lines[vehid][lineno];
            }
        }
    }
}

The issue i have is it doesn't take the next column into consideration. It also does not handle blank columns well. My csv will have 100~ columns, and 50~ of them might be entirely blank.

580

asked Dec 03 '19 19:12

T_Wrong

2 Answers

Could you please try following.

awk '
BEGIN{
  FS=","
}
FNR==NR{
  a[$2]++
  b[$2,$8]++
  c[$2]=(c[$2]?c[$2] ORS:"")$0
  next
}
a[$2]>1 && b[$2,$8]==1{
  print c[$2]
  delete a[$2]
}' <(sort -t',' -k2 Input_file) <(sort -t',' -k2 Input_file)

With your shown example output is as follows.

m,123veh,john;doe,10/1/2019,ryzen,split,32929,38757ace,turn,left
m,123veh,john;doe,10/1/2019,ryzen,split,32929,495842,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,38757,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,38757,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,4978d87,turn,left

Explanation: Adding detailed explanation for above code.

awk '                                                     ##Starting awk program from here.
BEGIN{                                                    ##Starting BEGIN section from here.
  FS=","                                                  ##Setting FS as comma here.
}                                                         ##Closing BEGIN section here.
FNR==NR{                                                  ##Checking condition FNR==NR which will be TRUE when first time Input_file is being read.
  a[$2]++                                                 ##Creating an array named a whose index is $2 and increment its value with 1 each time it comes here.
  b[$2,$8]++                                              ##Creating an array named b whose index is $2,$8 and increment its value with 1 each time it comes here.
  c[$2]=(c[$2]?c[$2] ORS:"")$0                            ##Creating an array named c whose index is $2 and value will be keep concatenating its same indexs whole line value.
  next                                                    ##next will skip all further statements from here.
}                                                         ##Closing BLOCK for FNR==NR condition here.
a[$2]>1 && b[$2,$8]==1{                                   ##Checking condition if array a with index $2 value is greater than 1 AND array b with index $2,$8 value is 1.
  print c[$2]                                             ##Then print array c value with $2 here.
  delete a[$2]                                            ##Deleting array a value with $2 here which will make sure NO DUPLICATE lines are getting printed.
}' <(sort -t',' -k2 file) <(sort -t',' -k2 file)          ##Sending Input_files in sorted format from 2nd field to make sure all values are coming together before doing operations on it.

answered Oct 02 '22 08:10

RavinderSingh13

A Python solution to this problem could be (here I used _id and qty for the 2 captured fields):

import csv
from collections import defaultdict

f = open('f1.txt', 'r')
d = defaultdict(lambda: defaultdict(list))

csv_reader = csv.reader(f)

for row in csv_reader:
    _id = row[1]
    qty = row[7]
    d[_id][qty].append(row)

f.close()

for _id in d:
    for qty in d[_id]:
        # if there are more than 1 'qty'
        # OR there is only 1 'qty' and there is only 1 line in the values
        # for the array (row) (allows a record with only 1 line)
        if len(d[_id]) > 1 or len(d[_id][qty]) == 1:
            for row in d[_id][qty]:
                print(','.join(row))

Prints:

m,123veh,john;doe,10/1/2019,ryzen,split,32929,38757ace,turn,left
m,123veh,john;doe,10/1/2019,ryzen,split,32929,495842,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,38757,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,38757,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,4978d87,turn,left

answered Oct 02 '22 06:10

Chris Charley

Related questions
                            
                                Python Pandas: Apply function using column names as named arguments
                            
                                'RelatedManager' object is not subscriptable
                            
                                How do I pass an argument via command line to set an environ in Python
                            
                                Specify plot title and facet title in Altair LayerChart
                            
                                Install pytorch from the source using pip
                            
                                How to create a dictionary from OS environment variables?
                            
                                LaunchAgent script can't write to external drive
                            
                                Convert JSON file to Pandas dataframe
                            
                                Breakdown an integer value to an array of integer maintaining the sum
                            
                                How to use request_id while logging in asynchronous functions?
                            
                                What is the proper way to decide whether to allow unsafe package versions in pip-tools?
                            
                                Python imports - ModuleNotFoundError: No module named X
                            
                                in python, how to make main args global
                            
                                How to remove git repository, in python, on windows
                            
                                Neural network is not giving the expected output after training in Python
                            
                                Python read pickle protocol 4 error: STACK_GLOBAL requires str
                            
                                where Environment variables for python are saved
                            
                                How to enforce dataclass fields' types? [duplicate]
                            
                                Retrieving data from python's coroutine object
                            
                                mypy "Incompatible import" error for conditional imports

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Finding duplicates, and uniques of the duplicates in a csv

Tags:

python

linux

bash

csv

awk

T_Wrong

People also ask

2 Answers

RavinderSingh13

Chris Charley

Recent Activity

Donate For Us