I have a pandas dataframe with 21 columns. I am focusing on a subset of rows that have exactly same column data values except for 6 that are unique to each row. I don't know which column headings these 6 values correspond to a priori.
I tried converting each row to Index objects, and performed set operation on two rows. Ex.
row1 = pd.Index(sample_data[0])
row2 = pd.Index(sample_data[1])
row1 - row2
which returns an Index object containing values unique to row1. Then I can manually deduce which columns have unique values.
How can I programmatically grab the column headings that these values correspond to in the initial dataframe? Or, is there a way to compare two or multiple dataframe rows and extract the 6 different column values of each row, as well as the corresponding headings? Ideally, it would be nice to generate a new dataframe with the unique columns.
In particular, is there a way to do this using set operations?
Thank you.
Here's a quick solution to return only the columns in which the first two rows differ.
In [13]: df = pd.DataFrame(zip(*[range(5), list('abcde'), list('aaaaa'),
... list('bbbbb')]), columns=list('ABCD'))
In [14]: df
Out[14]:
A B C D
0 0 a a b
1 1 b a b
2 2 c a b
3 3 d a b
4 4 e a b
In [15]: df[df.columns[df.iloc[0] != df.iloc[1]]]
Out[15]:
A B
0 0 a
1 1 b
2 2 c
3 3 d
4 4 e
And a solution to find all columns with more than one unique value throughout the entire frame.
In [33]: df[df.columns[df.apply(lambda s: len(s.unique()) > 1)]]
Out[33]:
A B
0 0 a
1 1 b
2 2 c
3 3 d
4 4 e
You don't really need the index, you could just compare two rows and use that to filter the columns with a list comprehension.
df = pd.DataFrame({"col1": np.ones(10), "col2": np.ones(10), "col3": range(2,12)})
row1 = df.irow(0)
row2 = df.irow(1)
unique_columns = row1 != row2
cols = [colname for colname, unique_column in zip(df.columns, bools) if unique_column]
print cols # ['col3']
If you know the standard value for each column, you can convert all the rows to a list of booleans, i.e.:
standard_row = np.ones(3)
columns = df.columns
unique_columns = df.apply(lambda x: x != standard_row, axis=1)
unique_columns.apply(lambda x: [col for col, unique_column in zip(columns, x) if unique_column], axis=1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With