I want to open a file, read it, drop duplicates in two of the file's columns, and then further use the file without the duplicates to do some calculations. To do this I am using pandas.drop_duplicates, which after dropping the duplicates also drops the indexing values. For example after droping line 1, file1 becomes file2:
file1:
Var1 Var2 Var3 Var4
0 52 2 3 89
1 65 2 3 43
2 15 1 3 78
3 33 2 4 67
file2:
Var1 Var2 Var3 Var4
0 52 2 3 89
2 15 1 3 78
3 33 2 4 67
To further use file2 as a dataframe I need to reindex it to 0, 1, 2, ...
Here is the code I am using:
file1 = pd.read_csv("filename.txt",sep='|', header=None, names=['Var1', 'Var2', 'Var3', 'Var4'])
file2 = file1.drop_duplicates(["Var2", "Var3"])
# create another variable as a new index: ni
file2['ni']= range(0, len(file2)) # this is the line that generates the warning
file2 = file2.set_index('ni')
Although the code runs and produces good results, reindexing, gives the following warning:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
file2['ni']= range(0, len(file2))
I did check the link but I cannot figure out how to change my code. Any ideas on how to fix this?
Drop duplicates and reset the index When we drop the rows from DataFrame, by default, it keeps the original row index as is. But, if we need to reset the index of the resultant DataFrame, we can do that using the ignore_index parameter of DataFrame. drop_duplicate() .
Return DataFrame with duplicate rows removed. Considering certain columns is optional. Indexes, including time indexes are ignored.
Pandas. Index. drop_duplicates() function is used to drop/remove duplicates from an index. It is often required to remove duplicate data as part of Data analysis.
Pandas DataFrame reindex() Method The reindex() method allows you to change the row indexes, and the columns labels. ;] Note: The values are set to NaN if the new index is not the same as the old.
Pandas has a built in function to accomplish this task, which will allow you to avoid the thrown error by means of an alternative, and simpler, approach
Rather than adding a new column of sequential numbers and then setting the index to that column as you did with:
file2['ni']= range(0, len(file2)) # this is the line that generates the warning
file2 = file2.set_index('ni')
You can instead use:
file2 = file2.reset_index(drop=True)
The default behavior of .reset_index()
is to take the current index, insert that index as the first column of the dataframe, and then build a new index (I assume the logic here is that the default behavior makes it very easy to compare the old vs. new index, very useful for sanity checks). drop=True
means instead of preserving the old index as a new column, just get rid of it and replace it with the new index, which seems like what you want.
all together, your new code could look like this
file1 = pd.read_csv("filename.txt",sep='|', header=None, names=['Var1', 'Var2', 'Var3', 'Var4'])
file2 = file1.drop_duplicates(["Var2", "Var3"]).reset_index(drop=True)
See this question as well
I think your .drop_duplicates()
is actually causing the warning.
Instead make sure you make a new copy of the dataframe:
file2 = file1.drop_duplicates(["Var2", "Var3"]).copy()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With