Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

create Pandas Dataframe with unique index

Tags:

python

pandas

Can I create a dataframe which has a unique index or columns, similar to creating an unique key in mysql, that it will return an error if I try to add a duplicate index?

Or is my only option to create an if-statement and check for the value in the dataframe before appending it?

EDIT:

It seems my question was a bit unclear. With unique columns I mean that we cannot have non-unique values in a column.

With

df.append(new_row, verify_integrity=True)

we can check for all columns, but how can we check for only one or two columns?

like image 970
user3605780 Avatar asked Jan 20 '18 15:01

user3605780


People also ask

Does index in pandas DataFrame have to be unique?

The column you want to index does not need to have unique values.

What does unique () do in Python pandas?

The unique function in pandas is used to find the unique values from a series. A series is a single column of a data frame. We can use the unique function on any possible set of elements in Python. It can be used on a series of strings, integers, tuples, or mixed elements.


2 Answers

OP's follow-up question:

With df.append(new_row, verify_integrity=True), we can check for all columns, but how can we check for only one or two columns?

To check uniqueness of just one column, say the column name is value, one can try

df['value'].duplicated().any()

This will check whether any in this column is duplicated. If duplicated, then it is not unique.


Given two columns, say C1 and C2,to check whether there are duplicated rows, we can still use DataFrame.duplicated.

df[["C1", "C2"]].duplicated()

It will check row-wise uniqueness. You can again use any to check if any of the returned value is True.


Given 2 columns, say C1 and C2, to check whether each column contains duplicated value, we can use apply.

df[["C1", "C2"]].apply(lambda x: x.duplicated().any())

This will apply the function to each column.


NOTE

pd.DataFrame([[np.nan, np.nan],
              [ np.nan, np.nan]]).duplicated()

0    False
1     True
dtype: bool

np.nan will also be captured by duplicated. If you want to ignore np.nan, you can try select the non-nan part first.

like image 25
Tai Avatar answered Sep 28 '22 08:09

Tai


You can use df.append(..., verify_integrity=True) to maintain a unique row index:

import numpy as np
import pandas as pd

df = pd.DataFrame(np.arange(12).reshape(3,4), columns=list('ABCD'))
dup_row = pd.DataFrame([[10,20,30,40]], columns=list('ABCD'), index=[1])
new_row = pd.DataFrame([[10,20,30,40]], columns=list('ABCD'), index=[9])

This successfully appends a new row (with index 9):

df.append(new_row, verify_integrity=True)
#     A   B   C   D
# 0   0   1   2   3
# 1   4   5   6   7
# 2   8   9  10  11
# 9  10  20  30  40

This raises ValueError because 1 is already in the index:

df.append(dup_row, verify_integrity=True)
# ValueError: Indexes have overlapping values: [1]

While the above works to ensure a unique row index, I'm not aware of a similar method for ensuring a unique column index. In theory you could transpose the DataFrame, append with verify_integrity=True and then transpose again, but generally I would not recommend this since transposing can alter dtypes when the column dtypes are not all the same. (When the column dtypes are not all the same the transposed DataFrame gets columns of object dtype. Conversion to and from object arrays can be bad for performance.)

If you need both unique row- and column- Indexes, then perhaps a better alternative is to stack your DataFrame so that all the unique column index levels become row index levels. Then you can use append with verify_integrity=True on the reshaped DataFrame.

like image 69
unutbu Avatar answered Sep 28 '22 08:09

unutbu