I know it might be old debate, but out of <code>pandas.drop</code> and <code>python del</code> function which is better in terms of performance over large dataset? I am learning machine learning using <code>python 3</code> and not sure which one to use. My data is in <code>pandas</code> data frame format. But <code>python del</code> function is in <code>built-in function</code> for python.

Summarizing a few points about functionality: <ul> <li> <code>drop</code> operates on both columns and rows; <code>del</code> operates on column only. </li> <li> <code>drop</code> can operate on multiple items at a time; <code>del</code> operates only on one at a time. </li> <li> <code>drop</code> can operate in-place or return a copy; <code>del</code> is an in-place operation only. </li> </ul> The documentation at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html has more details on <code>drop</code>'s features.

Using randomly generated data of about 1.6 GB, it appears that <code>df.drop</code> is faster than <code>del</code>, especially over multiple columns: <pre class="prettyprint"><code>df = pd.DataFrame(np.random.rand(20000,10000)) t_1 = time.time() df.drop(labels=[2,4,1000], inplace=True) t_2 = time.time() print(t_2 - t_1) </code></pre> 0.9118959903717041 Compared to: <pre class="prettyprint"><code>df = pd.DataFrame(np.random.rand(20000,10000)) t_3 = time.time() del df[2] del df[4] del df[1000] t_4 = time.time() print(t_4 - t_3) </code></pre> 4.052732944488525 @Inder's comparison is not quite the same since it doesn't use <code>inplace=True</code>.

python del vs pandas drop

2 Answers

Summarizing a few points about functionality:

drop operates on both columns and rows; del operates on column only.
drop can operate on multiple items at a time; del operates only on one at a time.
drop can operate in-place or return a copy; del is an in-place operation only.

The documentation at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html has more details on drop's features.

196

answered Oct 21 '22 01:10

flow2k

Using randomly generated data of about 1.6 GB, it appears that df.drop is faster than del, especially over multiple columns:

df = pd.DataFrame(np.random.rand(20000,10000))
t_1 = time.time()
df.drop(labels=[2,4,1000], inplace=True)
t_2 = time.time()
print(t_2 - t_1)

0.9118959903717041

Compared to:

df = pd.DataFrame(np.random.rand(20000,10000))
t_3 = time.time()
del df[2]
del df[4]
del df[1000]
t_4 = time.time()
print(t_4 - t_3)

4.052732944488525

@Inder's comparison is not quite the same since it doesn't use inplace=True.

answered Oct 20 '22 23:10

KT12

Related questions
                            
                                Why is FrozenList different from tuple?
                            
                                Bokeh Plot with equal axes
                            
                                Threaded, non-blocking websocket client
                            
                                Reason why numpy rollaxis is so confusing?
                            
                                Comments in continuation lines
                            
                                Stop Django from creating migrations if the list of choices of a field changes
                            
                                Nested Blueprints in Flask?
                            
                                "ValueError: embedded null character" when using open()
                            
                                Pandas-style transform of grouped data on PySpark DataFrame
                            
                                Python, importing modules for testing
                            
                                What numbers that I can put in numpy.random.seed()?
                            
                                Adding extra functionality to parent class method without changing its name [duplicate]
                            
                                How to change User representation in Django Admin when used as Foreign Key?
                            
                                Celery and Flask in same docker-compose
                            
                                How to Use Lagged Time-Series Variables in a Python Pandas Regression Model?
                            
                                Ordering boxplot x-axis in seaborn
                            
                                Pandas pivot_table, sort values by columns
                            
                                How do I use Boto3 to launch an EC2 instance with an IAM role?
                            
                                `pyspark mllib` versus `pyspark ml` packages
                            
                                How two consecutive yield statement work in python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

python del vs pandas drop

Tags:

python

python-3.x

pandas

sagar jain

People also ask

2 Answers

flow2k

KT12

Recent Activity

Donate For Us