Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reindexing after pandas.drop_duplicates

I want to open a file, read it, drop duplicates in two of the file's columns, and then further use the file without the duplicates to do some calculations. To do this I am using pandas.drop_duplicates, which after dropping the duplicates also drops the indexing values. For example after droping line 1, file1 becomes file2:

file1:
   Var1    Var2    Var3   Var4
0    52     2       3      89
1    65     2       3      43
2    15     1       3      78
3    33     2       4      67

file2:
   Var1    Var2    Var3   Var4
0    52     2       3      89
2    15     1       3      78
3    33     2       4      67

To further use file2 as a dataframe I need to reindex it to 0, 1, 2, ...

Here is the code I am using:

file1 = pd.read_csv("filename.txt",sep='|', header=None, names=['Var1', 'Var2', 'Var3', 'Var4']) 
file2 = file1.drop_duplicates(["Var2", "Var3"])
# create another variable as a new index: ni
file2['ni']= range(0, len(file2)) # this is the line that generates the warning
file2 = file2.set_index('ni')

Although the code runs and produces good results, reindexing, gives the following warning:

SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  file2['ni']= range(0, len(file2))

I did check the link but I cannot figure out how to change my code. Any ideas on how to fix this?

like image 468
Brebenel Avatar asked Mar 05 '15 18:03

Brebenel


People also ask

How do I reset index after dropping duplicates?

Drop duplicates and reset the index When we drop the rows from DataFrame, by default, it keeps the original row index as is. But, if we need to reset the index of the resultant DataFrame, we can do that using the ignore_index parameter of DataFrame. drop_duplicate() .

Does drop duplicates ignore index?

Return DataFrame with duplicate rows removed. Considering certain columns is optional. Indexes, including time indexes are ignored.

How do I get rid of double index in pandas?

Pandas. Index. drop_duplicates() function is used to drop/remove duplicates from an index. It is often required to remove duplicate data as part of Data analysis.

What is use of Reindexing in pandas?

Pandas DataFrame reindex() Method The reindex() method allows you to change the row indexes, and the columns labels. ;] Note: The values are set to NaN if the new index is not the same as the old.


2 Answers

Pandas has a built in function to accomplish this task, which will allow you to avoid the thrown error by means of an alternative, and simpler, approach

Rather than adding a new column of sequential numbers and then setting the index to that column as you did with:

file2['ni']= range(0, len(file2)) # this is the line that generates the warning
file2 = file2.set_index('ni')

You can instead use:

file2 = file2.reset_index(drop=True)

The default behavior of .reset_index() is to take the current index, insert that index as the first column of the dataframe, and then build a new index (I assume the logic here is that the default behavior makes it very easy to compare the old vs. new index, very useful for sanity checks). drop=True means instead of preserving the old index as a new column, just get rid of it and replace it with the new index, which seems like what you want.

all together, your new code could look like this

file1 = pd.read_csv("filename.txt",sep='|', header=None, names=['Var1', 'Var2', 'Var3', 'Var4']) 
file2 = file1.drop_duplicates(["Var2", "Var3"]).reset_index(drop=True)

See this question as well

like image 89
cjprybol Avatar answered Oct 17 '22 07:10

cjprybol


I think your .drop_duplicates() is actually causing the warning.

Instead make sure you make a new copy of the dataframe:

file2 = file1.drop_duplicates(["Var2", "Var3"]).copy()
like image 4
jorijnsmit Avatar answered Oct 17 '22 09:10

jorijnsmit