I have a CSV file (not normalized, example, real file up to 100 columns): <pre class="prettyprint"><code> ID, CUST_NAME, CLIENT_NAME, PAYMENT_NUM, START_DATE, END_DATE 1, CUST1, CLIENT1, 10, 2018-04-01, 2018-04-02 2, CUST1, CLIENT1, 10, 2018-04-01, 2018-05-30 3, CUST1, CLIENT1, 101, 2018-04-02, 2018-04-03 4, CUST2, CLIENT1, 102, 2018-04-02, 2018-04-03 </code></pre> How can I find ALL possible sets of columns which could be used as Primary key. Desired output: <pre class="prettyprint"><code> 1) ID 2) PAYMENT_NUM,START_DATE,END_DATE 3) CUST_NAME, CLIENT_NAME, PAYMENT_NUM,START_DATE,END_DATE </code></pre> I could do it in Java but may be Python/Pandas already provides a quick solution

pandas and itertools will give you what you're looking for. <pre class="prettyprint"><code>import pandas from itertools import chain, combinations def key_options(items): return chain.from_iterable(combinations(items, r) for r in range(1, len(items)+1) ) df = pandas.read_csv('test.csv'); # iterate over all combos of headings, excluding ID for brevity for candidate in key_options(list(df)[1:]): deduped = df.drop_duplicates(candidate) if len(deduped.index) == len(df.index): print ','.join(candidate) </code></pre> This gives you the output: <pre class="prettyprint"><code>PAYMENT_NUM, END_DATE CUST_NAME, CLIENT_NAME, END_DATE CUST_NAME, PAYMENT_NUM, END_DATE CLIENT_NAME, PAYMENT_NUM, END_DATE PAYMENT_NUM, START_DATE, END_DATE CUST_NAME, CLIENT_NAME, PAYMENT_NUM, END_DATE CUST_NAME, CLIENT_NAME, START_DATE, END_DATE CUST_NAME, PAYMENT_NUM, START_DATE, END_DATE CLIENT_NAME, PAYMENT_NUM, START_DATE, END_DATE CUST_NAME, CLIENT_NAME, PAYMENT_NUM, START_DATE, END_DATE </code></pre>

This is one way via <code>itertools.combinations</code>. It works by, for each set of columns, dropping duplicates and checking if the size of the dataframe changes. This results in 44 distinct combinations of columns. <pre class="prettyprint"><code>from itertools import combinations, chain full_list = chain.from_iterable(combinations(df, i) for i in range(1, len(df.columns)+1)) n = len(df.index) res = [] for cols in full_list: cols = list(cols) if len(df[cols].drop_duplicates().index) == n: res.append(cols) print(len(res)) # 44 </code></pre>

How to find a columns set for a primary key candidate in CSV file?

Tags:

python

algorithm

sql

pandas

I have a CSV file (not normalized, example, real file up to 100 columns):

   ID, CUST_NAME, CLIENT_NAME, PAYMENT_NUM, START_DATE, END_DATE
    1,     CUST1,     CLIENT1,          10, 2018-04-01, 2018-04-02
    2,     CUST1,     CLIENT1,          10, 2018-04-01, 2018-05-30
    3,     CUST1,     CLIENT1,         101, 2018-04-02, 2018-04-03
    4,     CUST2,     CLIENT1,         102, 2018-04-02, 2018-04-03

How can I find ALL possible sets of columns which could be used as Primary key.

Desired output:

  1) ID
  2) PAYMENT_NUM,START_DATE,END_DATE
  3) CUST_NAME, CLIENT_NAME, PAYMENT_NUM,START_DATE,END_DATE

I could do it in Java but may be Python/Pandas already provides a quick solution

353

asked Apr 24 '18 09:04

GML-VS

2 Answers

pandas and itertools will give you what you're looking for.

import pandas
from itertools import chain, combinations

def key_options(items):
    return chain.from_iterable(combinations(items, r) for r in range(1, len(items)+1) )

df = pandas.read_csv('test.csv');

# iterate over all combos of headings, excluding ID for brevity
for candidate in key_options(list(df)[1:]):
    deduped = df.drop_duplicates(candidate)

    if len(deduped.index) == len(df.index):
        print ','.join(candidate)

This gives you the output:

PAYMENT_NUM, END_DATE
CUST_NAME, CLIENT_NAME, END_DATE
CUST_NAME, PAYMENT_NUM, END_DATE
CLIENT_NAME, PAYMENT_NUM, END_DATE
PAYMENT_NUM, START_DATE, END_DATE
CUST_NAME, CLIENT_NAME, PAYMENT_NUM, END_DATE
CUST_NAME, CLIENT_NAME, START_DATE, END_DATE
CUST_NAME, PAYMENT_NUM, START_DATE, END_DATE
CLIENT_NAME, PAYMENT_NUM, START_DATE, END_DATE
CUST_NAME, CLIENT_NAME, PAYMENT_NUM, START_DATE, END_DATE

108

answered Sep 25 '22 18:09

Simon Brahan

This is one way via itertools.combinations. It works by, for each set of columns, dropping duplicates and checking if the size of the dataframe changes.

This results in 44 distinct combinations of columns.

from itertools import combinations, chain

full_list = chain.from_iterable(combinations(df, i) for i in range(1, len(df.columns)+1))

n = len(df.index)

res = []
for cols in full_list:
    cols = list(cols)
    if len(df[cols].drop_duplicates().index) == n:
        res.append(cols)

print(len(res))  # 44

answered Sep 22 '22 18:09

jpp

Related questions
                            
                                How to prevent PyCharm from overriding default backend as set in matplotlib?
                            
                                PIP (Python) : ImportError: cannot import name _remove_dead_weakref
                            
                                Filtering with MultiIndex
                            
                                Numpy array: group by one column, sum another
                            
                                What does it mean for a tensor to have shape [None, x] in TensorFlow? [duplicate]
                            
                                Calculate nunique() for groupby in pandas
                            
                                How to print weights in Tensorflow?
                            
                                `np.concatenate` a numpy array with a sparse matrix
                            
                                Properly terminate flask web app running in a thread
                            
                                How to use Keras with GPU?
                            
                                Pyspark Dataframe: Get previous row that meets a condition
                            
                                Open file from zip without extracting it in Python?
                            
                                Interactive BSpline fitting in Python
                            
                                When inheriting SQLAlchemy class from abstract class exception thrown: metaclass conflict: the metaclass of a derived class must be
                            
                                How to stop OpenCV error message from printing in Python
                            
                                Pandas group by one column concatenate values of other column as delimited list
                            
                                Python Pandas: get rows of a DataFrame where a column is not null
                            
                                Python dataframe: Finding a value in same row as a defined value in a different column
                            
                                Class cannot subclass 'QObject' (has type 'Any') using mypy
                            
                                Django decorator @transaction.non_atomic_requests not working in a ViewSet method

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With