Can I create a dataframe which has a unique index or columns, similar to creating an unique key in mysql, that it will return an error if I try to add a duplicate index?
Or is my only option to create an if-statement and check for the value in the dataframe before appending it?
EDIT:
It seems my question was a bit unclear. With unique columns I mean that we cannot have non-unique values in a column.
With
df.append(new_row, verify_integrity=True)
we can check for all columns, but how can we check for only one or two columns?
The column you want to index does not need to have unique values.
The unique function in pandas is used to find the unique values from a series. A series is a single column of a data frame. We can use the unique function on any possible set of elements in Python. It can be used on a series of strings, integers, tuples, or mixed elements.
OP's follow-up question:
With df.append(new_row, verify_integrity=True), we can check for all columns, but how can we check for only one or two columns?
To check uniqueness of just one column, say the column name is value
, one can try
df['value'].duplicated().any()
This will check whether any in this column is duplicated. If duplicated, then it is not unique.
Given two columns, say C1
and C2
,to check whether there are duplicated rows, we can still use DataFrame.duplicated
.
df[["C1", "C2"]].duplicated()
It will check row-wise uniqueness. You can again use any
to check if any of the returned value is True
.
Given 2 columns, say C1
and C2
, to check whether each column contains duplicated value, we can use apply.
df[["C1", "C2"]].apply(lambda x: x.duplicated().any())
This will apply the function to each column.
pd.DataFrame([[np.nan, np.nan],
[ np.nan, np.nan]]).duplicated()
0 False
1 True
dtype: bool
np.nan
will also be captured by duplicated
. If you want to ignore np.nan
, you can try select the non-nan part first.
You can use df.append(..., verify_integrity=True)
to maintain a unique row index:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(12).reshape(3,4), columns=list('ABCD'))
dup_row = pd.DataFrame([[10,20,30,40]], columns=list('ABCD'), index=[1])
new_row = pd.DataFrame([[10,20,30,40]], columns=list('ABCD'), index=[9])
This successfully appends a new row (with index 9):
df.append(new_row, verify_integrity=True)
# A B C D
# 0 0 1 2 3
# 1 4 5 6 7
# 2 8 9 10 11
# 9 10 20 30 40
This raises ValueError because 1 is already in the index:
df.append(dup_row, verify_integrity=True)
# ValueError: Indexes have overlapping values: [1]
While the above works to ensure a unique row index, I'm not aware of a similar method for ensuring a unique column index. In theory you could transpose the DataFrame, append with verify_integrity=True
and then transpose again, but generally I would not recommend this since transposing can alter dtypes when the column dtypes are not all the same. (When the column dtypes are not all the same the transposed DataFrame gets columns of object
dtype. Conversion to and from object arrays can be bad for performance.)
If you need both unique row- and column- Indexes, then perhaps a better alternative is to stack
your DataFrame so that all the unique column index levels become row index levels. Then you can use append
with verify_integrity=True
on the reshaped DataFrame.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With