How can the length of the lists in the column be determine without iteration? I have a dataframe like this: <pre class="prettyprint"><code> CreationDate 2013-12-22 15:25:02 [ubuntu, mac-osx, syslinux] 2009-12-14 14:29:32 [ubuntu, mod-rewrite, laconica, apache-2.2] 2013-12-22 15:42:00 [ubuntu, nat, squid, mikrotik] </code></pre> I am calculating the length of lists in the <code>CreationDate</code> column and making a new <code>Length</code> column like this: <pre class="prettyprint"><code>df['Length'] = df.CreationDate.apply(lambda x: len(x)) </code></pre> Which gives me this: <pre class="prettyprint"><code> CreationDate Length 2013-12-22 15:25:02 [ubuntu, mac-osx, syslinux] 3 2009-12-14 14:29:32 [ubuntu, mod-rewrite, laconica, apache-2.2] 4 2013-12-22 15:42:00 [ubuntu, nat, squid, mikrotik] 4 </code></pre> Is there a more pythonic way to do this?

You can use the <code>str</code> accessor for some list operations as well. In this example, <pre class="prettyprint"><code>df['CreationDate'].str.len() </code></pre> returns the length of each list. See the docs for <code>str.len</code>. <pre class="prettyprint"><code>df['Length'] = df['CreationDate'].str.len() df Out: CreationDate Length 2013-12-22 15:25:02 [ubuntu, mac-osx, syslinux] 3 2009-12-14 14:29:32 [ubuntu, mod-rewrite, laconica, apache-2.2] 4 2013-12-22 15:42:00 [ubuntu, nat, squid, mikrotik] 4 </code></pre> For these operations, vanilla Python is generally faster. pandas handles NaNs though. Here are timings: <pre class="prettyprint"><code>ser = pd.Series([random.sample(string.ascii_letters, random.randint(1, 20)) for _ in range(10**6)]) %timeit ser.apply(lambda x: len(x)) 1 loop, best of 3: 425 ms per loop %timeit ser.str.len() 1 loop, best of 3: 248 ms per loop %timeit [len(x) for x in ser] 10 loops, best of 3: 84 ms per loop %timeit pd.Series([len(x) for x in ser], index=ser.index) 1 loop, best of 3: 236 ms per loop </code></pre>

How to determine the length of lists in a pandas dataframe column

Tags:

python

pandas

How can the length of the lists in the column be determine without iteration?

I have a dataframe like this:

                                                    CreationDate
2013-12-22 15:25:02                  [ubuntu, mac-osx, syslinux]
2009-12-14 14:29:32  [ubuntu, mod-rewrite, laconica, apache-2.2]
2013-12-22 15:42:00               [ubuntu, nat, squid, mikrotik]

I am calculating the length of lists in the CreationDate column and making a new Length column like this:

df['Length'] = df.CreationDate.apply(lambda x: len(x))

Which gives me this:

                                                    CreationDate  Length
2013-12-22 15:25:02                  [ubuntu, mac-osx, syslinux]       3
2009-12-14 14:29:32  [ubuntu, mod-rewrite, laconica, apache-2.2]       4
2013-12-22 15:42:00               [ubuntu, nat, squid, mikrotik]       4

Is there a more pythonic way to do this?

422

asked Dec 27 '16 06:12

Mohammad Yusuf

Video Answer

2 Answers

You can use the str accessor for some list operations as well. In this example,

df['CreationDate'].str.len()

returns the length of each list. See the docs for str.len.

df['Length'] = df['CreationDate'].str.len()
df
Out: 
                                                    CreationDate  Length
2013-12-22 15:25:02                  [ubuntu, mac-osx, syslinux]       3
2009-12-14 14:29:32  [ubuntu, mod-rewrite, laconica, apache-2.2]       4
2013-12-22 15:42:00               [ubuntu, nat, squid, mikrotik]       4

For these operations, vanilla Python is generally faster. pandas handles NaNs though. Here are timings:

ser = pd.Series([random.sample(string.ascii_letters, 
                               random.randint(1, 20)) for _ in range(10**6)])

%timeit ser.apply(lambda x: len(x))
1 loop, best of 3: 425 ms per loop

%timeit ser.str.len()
1 loop, best of 3: 248 ms per loop

%timeit [len(x) for x in ser]
10 loops, best of 3: 84 ms per loop

%timeit pd.Series([len(x) for x in ser], index=ser.index)
1 loop, best of 3: 236 ms per loop

199

answered Oct 12 '22 05:10

ayhan

pandas.Series.map(len) and pandas.Series.apply(len) are equivalent in execution time, and slightly faster than pandas.Series.str.len().
- pandas.Series.map
- pandas.Series.apply
- pandas.Series.str.len
Difference between map, applymap and apply methods in Pandas

import pandas as pd

data = {'os': [['ubuntu', 'mac-osx', 'syslinux'], ['ubuntu', 'mod-rewrite', 'laconica', 'apache-2.2'], ['ubuntu', 'nat', 'squid', 'mikrotik']]}
index = ['2013-12-22 15:25:02', '2009-12-14 14:29:32', '2013-12-22 15:42:00']

df = pd.DataFrame(data, index)

# create Length column
df['Length'] = df.os.map(len)

# display(df)
                                                              os  Length
2013-12-22 15:25:02                  [ubuntu, mac-osx, syslinux]       3
2009-12-14 14:29:32  [ubuntu, mod-rewrite, laconica, apache-2.2]       4
2013-12-22 15:42:00               [ubuntu, nat, squid, mikrotik]       4

`%timeit`

import pandas as pd
import random
import string

random.seed(365)

ser = pd.Series([random.sample(string.ascii_letters, random.randint(1, 20)) for _ in range(10**6)])

%timeit ser.str.len()
252 ms ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit ser.map(len)
220 ms ± 7.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit ser.apply(len)
222 ms ± 8.31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

answered Oct 12 '22 03:10

Trenton McKinney

Related questions
                            
                                Python TypeError: non-empty format string passed to object.__format__
                            
                                Handle Flask requests concurrently with threaded=True
                            
                                python multiple inheritance passing arguments to constructors using super
                            
                                Where do I define the domain to be used by url_for() in Flask?
                            
                                Pylint invalid constant name
                            
                                How can I programmatically obtain the max_length of a Django model field?
                            
                                Django: Where to put helper functions?
                            
                                Why does b+=(4,) work and b = b + (4,) doesn't work when b is a list?
                            
                                Plot mean and standard deviation
                            
                                Scikit Learn SVC decision_function and predict
                            
                                Generating HTML documents in python
                            
                                How to set different levels for different python log handlers
                            
                                Why does append() always return None in Python? [duplicate]
                            
                                Cartesian product of a dictionary of lists
                            
                                UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 3 2: ordinal not in range(128)
                            
                                What is the purpose of the colon before a block in Python?
                            
                                Pandas: Appending a row to a dataframe and specify its index label
                            
                                Is there a way to get the largest integer one can use in Python? [duplicate]
                            
                                How to extend Python Enum?
                            
                                How to share conda environments across platforms

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With