Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python del vs pandas drop

I know it might be old debate, but out of pandas.drop and python del function which is better in terms of performance over large dataset?

I am learning machine learning using python 3 and not sure which one to use. My data is in pandas data frame format. But python del function is in built-in function for python.

like image 823
sagar jain Avatar asked Nov 22 '17 02:11

sagar jain


People also ask

What is Pandas drop?

Pandas DataFrame drop() Method The drop() method removes the specified row or column. By specifying the column axis ( axis='columns' ), the drop() method removes the specified column. By specifying the row axis ( axis='index' ), the drop() method removes the specified row.

How do you drop the first 5 rows in Pandas?

Remove First N Rows of Pandas DataFrame Using tail()tail(df. shape[0] -n) to remove the top/first n rows of pandas DataFrame. Generally, DataFrame. tail() function is used to show the last n rows of a pandas DataFrame but you can pass a negative value to skip the rows from the beginning.

Is Pandas faster than list Python?

From the above, we can see that for summation, the DataFrame implementation is only slightly faster than the List implementation. This difference is much more pronounced for the more complicated Haversine function, where the DataFrame implementation is about 10X faster than the List implementation.


2 Answers

Summarizing a few points about functionality:

  • drop operates on both columns and rows; del operates on column only.
  • drop can operate on multiple items at a time; del operates only on one at a time.
  • drop can operate in-place or return a copy; del is an in-place operation only.

The documentation at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html has more details on drop's features.

like image 196
flow2k Avatar answered Oct 21 '22 01:10

flow2k


Using randomly generated data of about 1.6 GB, it appears that df.drop is faster than del, especially over multiple columns:

df = pd.DataFrame(np.random.rand(20000,10000))
t_1 = time.time()
df.drop(labels=[2,4,1000], inplace=True)
t_2 = time.time()
print(t_2 - t_1)

0.9118959903717041

Compared to:

df = pd.DataFrame(np.random.rand(20000,10000))
t_3 = time.time()
del df[2]
del df[4]
del df[1000]
t_4 = time.time()
print(t_4 - t_3)

4.052732944488525

@Inder's comparison is not quite the same since it doesn't use inplace=True.

like image 39
KT12 Avatar answered Oct 20 '22 23:10

KT12