Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When should I use dt.column vs dt['column'] pandas?

Tags:

python

pandas

I was doing some calculations and row manipulations and realised that for some tasks such as mathematical operations they both worked e.g.

d['c3'] = d.c1 / d. c2
d['c3'] = d['c1'] / d['c2']

I was wondering whether there are some instances where using one is better than the other or what most people used.

like image 294
Tank Avatar asked Jun 28 '17 08:06

Tank


People also ask

How do you use DT in pandas?

dt can be used to access the values of the series as datetimelike and return several properties. Pandas Series. dt. year attribute return a numpy array containing year of the datetime in the underlying data of the given series object.

Is DF apply faster than Iterrows?

The results show that apply massively outperforms iterrows . As mentioned previously, this is because apply is optimized for looping through dataframe rows much quicker than iterrows does. While slower than apply , itertuples is quicker than iterrows , so if looping is required, try implementing itertuples instead.

Is Datatable faster than pandas?

While the process takes 16.62 seconds for Pandas, Datatable is only at 6.55 seconds. Overall Datatable is 2 times faster than Pandas.

What is the purpose of DF columns?

It can be thought of as a dict-like container for Series objects. This is the primary data structure of the Pandas. Pandas DataFrame. columns attribute return the column labels of the given Dataframe.


Video Answer


1 Answers

You should really just stop accessing columns as attributes and get into the habit of accessing using square brackets []. This avoids errors where your column names have illegal characters in python, embedded spaces, where your column name shares the same name as a built-in method, and ambiguous usage where for instance you have a column named index:

In[13]:
df = pd.DataFrame(np.random.randn(5,4), columns=[' a', 'mean', 'index', '2'])
df.columns.tolist()

Out[13]: [' a', 'mean', 'index', '2']

So if we now try to access column 2:

In[14]:
df.2
  File "<ipython-input-14-0490d6ae2ca0>", line 1
    df.2
       ^
SyntaxError: invalid syntax

It fails as it's an invalid name but df['2'] would work

In[15]:

df.a
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-15-b9872a8755ac> in <module>()
----> 1 df.a

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   3079             if name in self._info_axis:
   3080                 return self[name]
-> 3081             return object.__getattribute__(self, name)
   3082 
   3083     def __setattr__(self, name, value):

AttributeError: 'DataFrame' object has no attribute 'a'

So because this is really ' a' with a leading space (this would also fail if there were spaces anywhere in the column name) it fails on KeyError

In[16]:
df.mean

Out[16]: 
<bound method DataFrame.mean of           a      mean     index         2
0 -0.022122  1.858308  1.823314  0.238105
1 -0.461662  0.482116  1.848322  1.946922
2  0.615889 -0.285043  0.201804 -0.656065
3  0.159351 -1.151883 -1.858024  0.088460
4  1.066735  1.015585  0.586550 -1.898469>

This is more subtle, it looks like it did something but in fact it just returns the method address, here ipython is just pretty printing it

In[17]:
df.index

Out[17]: RangeIndex(start=0, stop=5, step=1)

Above we have ambiguous intentions, because the index is a member it's returned that instead of the column 'index'.

So you should stop accessing columns as attributes and always use square brackets as it avoids all the problems above

like image 64
EdChum Avatar answered Oct 16 '22 05:10

EdChum