Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find a columns set for a primary key candidate in CSV file?

I have a CSV file (not normalized, example, real file up to 100 columns):

   ID, CUST_NAME, CLIENT_NAME, PAYMENT_NUM, START_DATE, END_DATE
    1,     CUST1,     CLIENT1,          10, 2018-04-01, 2018-04-02
    2,     CUST1,     CLIENT1,          10, 2018-04-01, 2018-05-30
    3,     CUST1,     CLIENT1,         101, 2018-04-02, 2018-04-03
    4,     CUST2,     CLIENT1,         102, 2018-04-02, 2018-04-03

How can I find ALL possible sets of columns which could be used as Primary key.

Desired output:

  1) ID
  2) PAYMENT_NUM,START_DATE,END_DATE
  3) CUST_NAME, CLIENT_NAME, PAYMENT_NUM,START_DATE,END_DATE

I could do it in Java but may be Python/Pandas already provides a quick solution

like image 353
GML-VS Avatar asked Apr 24 '18 09:04

GML-VS


People also ask

How do you find the primary key in a data frame?

If it's unique key, then length of row axis should equal count of unique values. Based on your full problem statement, you most likely need to use Name and Phone together to create the unique index. Great. In that case it's your call for which column to use as the primary key based on the context of your case.


2 Answers

pandas and itertools will give you what you're looking for.

import pandas
from itertools import chain, combinations

def key_options(items):
    return chain.from_iterable(combinations(items, r) for r in range(1, len(items)+1) )

df = pandas.read_csv('test.csv');

# iterate over all combos of headings, excluding ID for brevity
for candidate in key_options(list(df)[1:]):
    deduped = df.drop_duplicates(candidate)

    if len(deduped.index) == len(df.index):
        print ','.join(candidate)

This gives you the output:

PAYMENT_NUM, END_DATE
CUST_NAME, CLIENT_NAME, END_DATE
CUST_NAME, PAYMENT_NUM, END_DATE
CLIENT_NAME, PAYMENT_NUM, END_DATE
PAYMENT_NUM, START_DATE, END_DATE
CUST_NAME, CLIENT_NAME, PAYMENT_NUM, END_DATE
CUST_NAME, CLIENT_NAME, START_DATE, END_DATE
CUST_NAME, PAYMENT_NUM, START_DATE, END_DATE
CLIENT_NAME, PAYMENT_NUM, START_DATE, END_DATE
CUST_NAME, CLIENT_NAME, PAYMENT_NUM, START_DATE, END_DATE
like image 108
Simon Brahan Avatar answered Sep 25 '22 18:09

Simon Brahan


This is one way via itertools.combinations. It works by, for each set of columns, dropping duplicates and checking if the size of the dataframe changes.

This results in 44 distinct combinations of columns.

from itertools import combinations, chain

full_list = chain.from_iterable(combinations(df, i) for i in range(1, len(df.columns)+1))

n = len(df.index)

res = []
for cols in full_list:
    cols = list(cols)
    if len(df[cols].drop_duplicates().index) == n:
        res.append(cols)

print(len(res))  # 44
like image 27
jpp Avatar answered Sep 22 '22 18:09

jpp