Can I create a dataframe which has a unique index or columns, similar to creating an unique key in mysql, that it will return an error if I try to add a duplicate index? Or is my only option to create an if-statement and check for the value in the dataframe before appending it? EDIT: It seems my question was a bit unclear. With unique columns I mean that we cannot have non-unique values in a column. With <pre class="prettyprint"><code>df.append(new_row, verify_integrity=True) </code></pre> we can check for all columns, but how can we check for only one or two columns?

OP's follow-up question: <blockquote> With df.append(new_row, verify_integrity=True), we can check for all columns, but how can we check for only one or two columns? </blockquote> To check uniqueness of just one column, say the column name is <code>value</code>, one can try <pre class="prettyprint"><code>df['value'].duplicated().any() </code></pre> This will check whether any in this column is duplicated. If duplicated, then it is not unique. <hr> Given two columns, say <code>C1</code> and <code>C2</code>,to check whether there are duplicated rows, we can still use <code>DataFrame.duplicated</code>. <pre class="prettyprint"><code>df[["C1", "C2"]].duplicated() </code></pre> It will check row-wise uniqueness. You can again use <code>any</code> to check if any of the returned value is <code>True</code>. <hr> Given 2 columns, say <code>C1</code> and <code>C2</code>, to check whether each column contains duplicated value, we can use apply. <pre class="prettyprint"><code>df[["C1", "C2"]].apply(lambda x: x.duplicated().any()) </code></pre> This will apply the function to each column. <hr> <h3>NOTE</h3> <pre class="prettyprint"><code>pd.DataFrame([[np.nan, np.nan], [ np.nan, np.nan]]).duplicated() 0 False 1 True dtype: bool </code></pre> <code>np.nan</code> will also be captured by <code>duplicated</code>. If you want to ignore <code>np.nan</code>, you can try select the non-nan part first.

You can use <code>df.append(..., verify_integrity=True)</code> to maintain a unique row index: <pre class="prettyprint"><code>import numpy as np import pandas as pd df = pd.DataFrame(np.arange(12).reshape(3,4), columns=list('ABCD')) dup_row = pd.DataFrame([[10,20,30,40]], columns=list('ABCD'), index=[1]) new_row = pd.DataFrame([[10,20,30,40]], columns=list('ABCD'), index=[9]) </code></pre> This successfully appends a new row (with index 9): <pre class="prettyprint"><code>df.append(new_row, verify_integrity=True) # A B C D # 0 0 1 2 3 # 1 4 5 6 7 # 2 8 9 10 11 # 9 10 20 30 40 </code></pre> This raises ValueError because 1 is already in the index: <pre class="prettyprint"><code>df.append(dup_row, verify_integrity=True) # ValueError: Indexes have overlapping values: [1] </code></pre> <hr> While the above works to ensure a unique row index, I'm not aware of a similar method for ensuring a unique column index. In theory you could transpose the DataFrame, append with <code>verify_integrity=True</code> and then transpose again, but generally I would not recommend this since transposing can alter dtypes when the column dtypes are not all the same. (When the column dtypes are not all the same the transposed DataFrame gets columns of <code>object</code> dtype. Conversion to and from object arrays can be bad for performance.) If you need both unique row- and column- Indexes, then perhaps a better alternative is to <code>stack</code> your DataFrame so that all the unique column index levels become row index levels. Then you can use <code>append</code> with <code>verify_integrity=True</code> on the reshaped DataFrame.

create Pandas Dataframe with unique index

Tags:

python

pandas

Can I create a dataframe which has a unique index or columns, similar to creating an unique key in mysql, that it will return an error if I try to add a duplicate index?

Or is my only option to create an if-statement and check for the value in the dataframe before appending it?

EDIT:

It seems my question was a bit unclear. With unique columns I mean that we cannot have non-unique values in a column.

With

df.append(new_row, verify_integrity=True)

we can check for all columns, but how can we check for only one or two columns?

970

asked Jan 20 '18 15:01

user3605780

2 Answers

OP's follow-up question:

With df.append(new_row, verify_integrity=True), we can check for all columns, but how can we check for only one or two columns?

To check uniqueness of just one column, say the column name is value, one can try

df['value'].duplicated().any()

This will check whether any in this column is duplicated. If duplicated, then it is not unique.

Given two columns, say C1 and C2,to check whether there are duplicated rows, we can still use DataFrame.duplicated.

df[["C1", "C2"]].duplicated()

It will check row-wise uniqueness. You can again use any to check if any of the returned value is True.

Given 2 columns, say C1 and C2, to check whether each column contains duplicated value, we can use apply.

df[["C1", "C2"]].apply(lambda x: x.duplicated().any())

This will apply the function to each column.

NOTE

pd.DataFrame([[np.nan, np.nan],
              [ np.nan, np.nan]]).duplicated()

0    False
1     True
dtype: bool

np.nan will also be captured by duplicated. If you want to ignore np.nan, you can try select the non-nan part first.

answered Sep 28 '22 08:09

Tai

You can use df.append(..., verify_integrity=True) to maintain a unique row index:

import numpy as np
import pandas as pd

df = pd.DataFrame(np.arange(12).reshape(3,4), columns=list('ABCD'))
dup_row = pd.DataFrame([[10,20,30,40]], columns=list('ABCD'), index=[1])
new_row = pd.DataFrame([[10,20,30,40]], columns=list('ABCD'), index=[9])

This successfully appends a new row (with index 9):

df.append(new_row, verify_integrity=True)
#     A   B   C   D
# 0   0   1   2   3
# 1   4   5   6   7
# 2   8   9  10  11
# 9  10  20  30  40

This raises ValueError because 1 is already in the index:

df.append(dup_row, verify_integrity=True)
# ValueError: Indexes have overlapping values: [1]

While the above works to ensure a unique row index, I'm not aware of a similar method for ensuring a unique column index. In theory you could transpose the DataFrame, append with verify_integrity=True and then transpose again, but generally I would not recommend this since transposing can alter dtypes when the column dtypes are not all the same. (When the column dtypes are not all the same the transposed DataFrame gets columns of object dtype. Conversion to and from object arrays can be bad for performance.)

If you need both unique row- and column- Indexes, then perhaps a better alternative is to stack your DataFrame so that all the unique column index levels become row index levels. Then you can use append with verify_integrity=True on the reshaped DataFrame.

answered Sep 28 '22 08:09

unutbu

Related questions
                            
                                Subparsers.add_parser TypeError: __init__() got an unexpected keyword argument 'prog'
                            
                                Pandas update and add rows one dataframe with key column in another dataframe
                            
                                Can pygame detect the video mode of the screen?
                            
                                Put request working in curl, but not in Python
                            
                                Python OLS model: __init__() missing 1 required positional argument: 'endog'
                            
                                how to use after_request in flask to close database connection and python?
                            
                                Regular Expression Matching Stock Ticker
                            
                                Bokeh datetime axis, control of minor ticks
                            
                                How to send a message on the bot startup to every server it is in?
                            
                                How to safely count dictionary keys in python [duplicate]
                            
                                How Python execute [list] * num? what's time complexity and memory complexity?
                            
                                Pandas find Duplicates in cross values
                            
                                How to run django-pytest on a project which imports settings from environment variables defined in manage.py
                            
                                What is the function in TensorFlow that is equivalent to expand() in PyTorch?
                            
                                Pandas - how to convert RangeIndex into DateTimeIndex
                            
                                How do I change font in Jupyter Notebook
                            
                                python: bookkeeping dependencies in cached attributes that might change
                            
                                Format OCR text annotation from Cloud Vision API in Python
                            
                                How to test redirection in Django using pytest?
                            
                                How do I unpack a SQL Server DATETIME in a pyodbc Output Converter function?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With