I am trying to remove duplicates data in my dataframe (csv) and get a separate csv to show the unique answers of each column. The problem is that my code has been running for a day (22 Hours to be exact) I´m open to some other suggestions.
My data has about 20,000 rows with headers. I have tried to check the unique list one by one before like df[col].unique() and it does not take that long.
>df = pd.read_csv('Surveydata.csv')
>
>df_uni=df.apply(lambda col: col.drop_duplicates().reset_index(drop=True))
>
>df_uni.to_csv('Surveydata_unique.csv',index=False)
What I expect is the dataframe that has the same set of columns but without any duplication in each field. Ex. if df['Rmoisture'] has a combination of Yes,No,Nan it should have only these 3 contain in the same column of another dataframe df_uni.
EDIT: and here are examples input output
Both can be used to eliminate duplicated rows of a Spark DataFrame however, their difference is that distinct() takes no arguments at all, while dropDuplicates() can be given a subset of columns to consider when dropping duplicated records.
To drop duplicate columns from pandas DataFrame use df. T. drop_duplicates(). T , this removes all columns that have the same data regardless of column names.
To remove duplicates of only one or a subset of columns, specify subset as the individual column or list of columns that should be unique. To do this conditional on a different column's value, you can sort_values(colname) and specify keep equals either first or last .
Remove All Duplicate Rows from Pandas DataFrame You can set 'keep=False' in the drop_duplicates() function to remove all the duplicate rows. For E.x, df. drop_duplicates(keep=False) .
Another method:
new_df = []
[new_df.append(pd.DataFrame(df[i].unique(), columns=[i])) for i in df.columns]
new_df = pd.concat(new_df,axis=1)
print(new_df)
Mass Length Material Special Mark Special Num Breaking \
0 4.0 5.500000 Wood A 20.0 Yes
1 12.0 2.600000 Steel NaN NaN No
2 1.0 3.500000 Rubber B 5.5 NaN
3 15.0 6.500000 Plastic X 6.6 NaN
4 6.0 12.000000 NaN NaN 5.6 NaN
5 14.0 2.500000 NaN NaN 6.3 NaN
6 2.0 15.000000 NaN NaN NaN NaN
7 8.0 2.000000 NaN NaN NaN NaN
8 7.0 10.000000 NaN NaN NaN NaN
9 9.0 2.200000 NaN NaN NaN NaN
10 11.0 4.333333 NaN NaN NaN NaN
11 13.0 4.666667 NaN NaN NaN NaN
12 NaN 3.750000 NaN NaN NaN NaN
13 NaN 1.666667 NaN NaN NaN NaN
Comment
0 There is no heat
1 NaN
2 Contains moisture
3 Hit the table instead
4 A sign of wind
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
If order of values in columns is not important convert each column to set
for remove duplicates, then to Series
and join together by concat
:
df1 = pd.concat({k: pd.Series(list(set(v))) for k, v in df.to_dict('l').items()}, axis=1)
If order is important:
df1 = pd.concat({col: pd.Series(df[col].unique()) for col in df.columns}, axis=1)
Performance 1k unique values in 2k rows:
np.random.seed(2019)
#2k rows
df = pd.DataFrame(np.random.randint(1000, size=(20, 2000))).astype(str)
In [151]: %timeit df.apply(lambda col: col.drop_duplicates().reset_index(drop=True))
1.07 s ± 16.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [152]: %timeit pd.concat({k: pd.Series(list(set(v))) for k, v in df.to_dict('l').items()}, axis=1)
323 ms ± 2.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [153]: %timeit pd.concat({col: pd.Series(df[col].unique()) for col in df.columns}, axis=1)
430 ms ± 4.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Performance 100 unique values in 2k rows
df = pd.DataFrame(np.random.randint(100, size=(20, 2000))).astype(str)
In [155]: %timeit df.apply(lambda col: col.drop_duplicates().reset_index(drop=True))
1.3 s ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [156]: %timeit pd.concat({k: pd.Series(list(set(v))) for k, v in df.to_dict('l').items()}, axis=1)
544 ms ± 3.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [157]: %timeit pd.concat({col: pd.Series(df[col].unique()) for col in df.columns}, axis=1)
654 ms ± 3.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With