Reindexing after pandas.drop_duplicates

Tags:

I want to open a file, read it, drop duplicates in two of the file's columns, and then further use the file without the duplicates to do some calculations. To do this I am using pandas.drop_duplicates, which after dropping the duplicates also drops the indexing values. For example after droping line 1, file1 becomes file2:

file1:
   Var1    Var2    Var3   Var4
0    52     2       3      89
1    65     2       3      43
2    15     1       3      78
3    33     2       4      67

file2:
   Var1    Var2    Var3   Var4
0    52     2       3      89
2    15     1       3      78
3    33     2       4      67

To further use file2 as a dataframe I need to reindex it to 0, 1, 2, ...

Here is the code I am using:

file1 = pd.read_csv("filename.txt",sep='|', header=None, names=['Var1', 'Var2', 'Var3', 'Var4']) 
file2 = file1.drop_duplicates(["Var2", "Var3"])
# create another variable as a new index: ni
file2['ni']= range(0, len(file2)) # this is the line that generates the warning
file2 = file2.set_index('ni')

Although the code runs and produces good results, reindexing, gives the following warning:

SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  file2['ni']= range(0, len(file2))

I did check the link but I cannot figure out how to change my code. Any ideas on how to fix this?

468

asked Mar 05 '15 18:03

Brebenel

2 Answers

Pandas has a built in function to accomplish this task, which will allow you to avoid the thrown error by means of an alternative, and simpler, approach

Rather than adding a new column of sequential numbers and then setting the index to that column as you did with:

file2['ni']= range(0, len(file2)) # this is the line that generates the warning
file2 = file2.set_index('ni')

You can instead use:

file2 = file2.reset_index(drop=True)

The default behavior of .reset_index() is to take the current index, insert that index as the first column of the dataframe, and then build a new index (I assume the logic here is that the default behavior makes it very easy to compare the old vs. new index, very useful for sanity checks). drop=True means instead of preserving the old index as a new column, just get rid of it and replace it with the new index, which seems like what you want.

all together, your new code could look like this

file1 = pd.read_csv("filename.txt",sep='|', header=None, names=['Var1', 'Var2', 'Var3', 'Var4']) 
file2 = file1.drop_duplicates(["Var2", "Var3"]).reset_index(drop=True)

See this question as well

answered Oct 17 '22 07:10

cjprybol

I think your .drop_duplicates() is actually causing the warning.

Instead make sure you make a new copy of the dataframe:

file2 = file1.drop_duplicates(["Var2", "Var3"]).copy()

answered Oct 17 '22 09:10

jorijnsmit

Related questions
                            
                                How to apply custom column order (on Categorical) to pandas boxplot?
                            
                                decorator to set attributes of function
                            
                                Python: Split a string, respect and preserve quotes [duplicate]
                            
                                How to add stdout and stderr to logger file in flask
                            
                                Python gzip refuses to read uncompressed file
                            
                                Scrapy: How to manually insert a request from a spider_idle event callback?
                            
                                Writing xlwt dates with Excel 'date' format
                            
                                How do I align text output in python?
                            
                                Django : Can we use .exclude() on .get() in django querysets
                            
                                sqlalchemy.exc.CircularDependencyError: Circular dependency detected
                            
                                Python closure vs javascript closure
                            
                                Is wordnet path similarity commutative?
                            
                                pandas equivalent of Stata's encode
                            
                                How to access axis label object in matplotlib?
                            
                                Regex validation with WTForms and python
                            
                                What does a "Could not find .egg-info directory in install record" from pip mean?
                            
                                plotting multiple plots generated inside a for loop on the same axes python
                            
                                pytest -- how do I use global / session-wide fixtures?
                            
                                how to save an array as a grayscale image with matplotlib/numpy?
                            
                                Restrict static file access to logged in users

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Reindexing after pandas.drop_duplicates

Tags:

python

pandas

dataframe

reindex

Brebenel

People also ask

2 Answers

cjprybol

jorijnsmit

Recent Activity

Donate For Us