Is there any faster alternative to col.drop_duplicates()?

Tags:

I am trying to remove duplicates data in my dataframe (csv) and get a separate csv to show the unique answers of each column. The problem is that my code has been running for a day (22 Hours to be exact) I´m open to some other suggestions.

My data has about 20,000 rows with headers. I have tried to check the unique list one by one before like df[col].unique() and it does not take that long.

>df = pd.read_csv('Surveydata.csv')
>
>df_uni=df.apply(lambda col: col.drop_duplicates().reset_index(drop=True))
>
>df_uni.to_csv('Surveydata_unique.csv',index=False)

What I expect is the dataframe that has the same set of columns but without any duplication in each field. Ex. if df['Rmoisture'] has a combination of Yes,No,Nan it should have only these 3 contain in the same column of another dataframe df_uni.

EDIT: and here are examples input output

392

asked Jan 15 '19 10:01

AOJ keygen

2 Answers

Another method:

new_df = []
[new_df.append(pd.DataFrame(df[i].unique(), columns=[i])) for i in df.columns]
new_df = pd.concat(new_df,axis=1)
print(new_df)


   Mass     Length  Material  Special Mark  Special Num  Breaking  \
0    4.0   5.500000     Wood            A         20.0      Yes   
1   12.0   2.600000    Steel          NaN          NaN       No   
2    1.0   3.500000   Rubber            B          5.5      NaN   
3   15.0   6.500000  Plastic            X          6.6      NaN   
4    6.0  12.000000      NaN          NaN          5.6      NaN   
5   14.0   2.500000      NaN          NaN          6.3      NaN   
6    2.0  15.000000      NaN          NaN          NaN      NaN   
7    8.0   2.000000      NaN          NaN          NaN      NaN   
8    7.0  10.000000      NaN          NaN          NaN      NaN   
9    9.0   2.200000      NaN          NaN          NaN      NaN   
10  11.0   4.333333      NaN          NaN          NaN      NaN   
11  13.0   4.666667      NaN          NaN          NaN      NaN   
12   NaN   3.750000      NaN          NaN          NaN      NaN   
13   NaN   1.666667      NaN          NaN          NaN      NaN   

                  Comment  
0        There is no heat  
1                     NaN  
2       Contains moisture  
3   Hit the table instead  
4          A sign of wind  
5                     NaN  
6                     NaN  
7                     NaN  
8                     NaN  
9                     NaN  
10                    NaN  
11                    NaN  
12                    NaN  
13                    NaN

176

answered Oct 18 '22 01:10

anky

If order of values in columns is not important convert each column to set for remove duplicates, then to Series and join together by concat:

df1 = pd.concat({k: pd.Series(list(set(v))) for k, v in df.to_dict('l').items()}, axis=1)

If order is important:

df1 = pd.concat({col: pd.Series(df[col].unique()) for col in df.columns}, axis=1)

Performance 1k unique values in 2k rows:

np.random.seed(2019)

#2k rows
df = pd.DataFrame(np.random.randint(1000, size=(20, 2000))).astype(str)


In [151]: %timeit df.apply(lambda col: col.drop_duplicates().reset_index(drop=True))
1.07 s ± 16.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [152]: %timeit pd.concat({k: pd.Series(list(set(v))) for k, v in df.to_dict('l').items()}, axis=1)
323 ms ± 2.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [153]: %timeit pd.concat({col: pd.Series(df[col].unique()) for col in df.columns}, axis=1)
430 ms ± 4.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Performance 100 unique values in 2k rows

df = pd.DataFrame(np.random.randint(100, size=(20, 2000))).astype(str)

In [155]: %timeit df.apply(lambda col: col.drop_duplicates().reset_index(drop=True))
1.3 s ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [156]: %timeit pd.concat({k: pd.Series(list(set(v))) for k, v in df.to_dict('l').items()}, axis=1)
544 ms ± 3.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [157]: %timeit pd.concat({col: pd.Series(df[col].unique()) for col in df.columns}, axis=1)
654 ms ± 3.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

answered Oct 18 '22 02:10

jezrael

Related questions
                            
                                Python reshape a list to multidimensional list
                            
                                itertools.cycle(iterable) vs while True
                            
                                Saving 'float' numpy images
                            
                                How to get the name of a QMenu Item when clicked?
                            
                                How to map the differences between two strings?
                            
                                Dask compute is very slow
                            
                                How can I write a code to obtain the lowest values of each list within a list in python?
                            
                                mypy 0.6.4 return type Optional[str] but sometimes you have prior knowledge about the type you will get
                            
                                Plotting a grouped pandas data in plotly
                            
                                How to call a specific sheet within a spreadsheet via the Google Sheets API v4 in Python
                            
                                How to compute Spearman correlation in Tensorflow
                            
                                Not able to connect postgresql with odoo
                            
                                Is there a solution for required mutually exclusive arguments listed as optional in help section?
                            
                                Reshape vertical series to horizontal in Python
                            
                                Flask SQLalchemy can't connect to Google Cloud Postgresql database with Unix socket
                            
                                Hyperopt: Optimal parameter changing with rerun
                            
                                pd.DataFrame(data, columns=[]). How to pass a data which is with nested dictionary?
                            
                                Create a pandas column based on a lookup value from another dataframe
                            
                                Logical AND of multiple columns in pandas
                            
                                How to Find a tag without specific attribute using beautifulsoup?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is there any faster alternative to col.drop_duplicates()?

Tags:

python-3.x

pandas

jupyter-notebook

drop-duplicates

AOJ keygen

People also ask

2 Answers

anky

jezrael

Recent Activity

Donate For Us