Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas - Fast way of accessing a column of objects' attribute

Let's say I have a custom class in python, that has the attribute val. If I have a pandas dataframe with a column of these objects, how can I access this attribute and make a new column with this value?

Example data:

df
Out[46]: 
row   custom_object
1     foo1
2     foo2
3     foo3
4     foo4
Name: book, dtype: object

Where the custom objects are of class Foo:

class Foo:
    def __init__(self, val):
        self.val = val

The only way I know of to create a new column with the instance attributes is using an apply and lambda combo which is slow on large datasets:

df['custom_val'] = df['custom_object'].apply(lambda x: x.val)

Is there a more efficient way?

like image 249
guy Avatar asked Sep 27 '17 19:09

guy


People also ask

Is apply faster than Iterrows?

The results show that apply massively outperforms iterrows . As mentioned previously, this is because apply is optimized for looping through dataframe rows much quicker than iterrows does. While slower than apply , itertuples is quicker than iterrows , so if looping is required, try implementing itertuples instead.

Is Pandas query faster than LOC?

The query function seams more efficient than the loc function. DF2: 2K records x 6 columns. The loc function seams much more efficient than the query function.

Is Pyarrow faster than Pandas?

The pyarrow library is able to construct a pandas. DataFrame faster than using pandas.


1 Answers

You could use a list comprehension:

df['custom_val'] = [foo.val for foo in df['custom_object']]

Timings

# Set-up 100k Foo objects.
vals = [np.random.randn() for _ in range(100000)]
foos = [Foo(val) for val in vals]
df = pd.DataFrame(foos, columns=['custom_object'])

# 1) OP's apply method.
%timeit df['custom_object'].apply(lambda x: x.val)
# 10 loops, best of 3: 26.7 ms per loop

# 2) Using a list comprehension instead.
%timeit [foo.val for foo in df['custom_object']]
# 100 loops, best of 3: 11.7 ms per loop

# 3) For reference with the original list of objects (slightly faster than 2) above).
%timeit [foo.val for foo in foos]
# 100 loops, best of 3: 9.79 ms per loop

# 4) And just on the original list of raw values themselves.
%timeit [val for val in vals]
# 100 loops, best of 3: 4.91 ms per loop

If you had the original list of values, you could just assign them directly:

# 5) Direct assignment to list of values.
%timeit df['v'] = vals
# 100 loops, best of 3: 5.88 ms per loop
like image 55
Alexander Avatar answered Oct 12 '22 22:10

Alexander