Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to determine the length of lists in a pandas dataframe column

Tags:

python

pandas

How can the length of the lists in the column be determine without iteration?

I have a dataframe like this:

                                                    CreationDate
2013-12-22 15:25:02                  [ubuntu, mac-osx, syslinux]
2009-12-14 14:29:32  [ubuntu, mod-rewrite, laconica, apache-2.2]
2013-12-22 15:42:00               [ubuntu, nat, squid, mikrotik]

I am calculating the length of lists in the CreationDate column and making a new Length column like this:

df['Length'] = df.CreationDate.apply(lambda x: len(x))

Which gives me this:

                                                    CreationDate  Length
2013-12-22 15:25:02                  [ubuntu, mac-osx, syslinux]       3
2009-12-14 14:29:32  [ubuntu, mod-rewrite, laconica, apache-2.2]       4
2013-12-22 15:42:00               [ubuntu, nat, squid, mikrotik]       4

Is there a more pythonic way to do this?

like image 422
Mohammad Yusuf Avatar asked Dec 27 '16 06:12

Mohammad Yusuf


People also ask

How do you find the length of a list in a column in python?

There is a built-in function called len() for getting the total number of items in a list, tuple, arrays, dictionary, etc. The len() method takes an argument where you may provide a list and it returns the length of the given list.

How do you find the length of a series in python?

By using the python length function we can get the length of the Series object, as well as size and shape attributes will return the count of elements and dimension of the series.

How do you find the length of a column in a data frame?

Get the number of columns: len(df. columns) The number of columns of pandas. DataFrame can be obtained by applying len() to the columns attribute.

How do I get pandas DataFrame length?

Get Number of Rows in DataFrame You can use len(df. index) to find the number of rows in pandas DataFrame, df. index returns RangeIndex(start=0, stop=8, step=1) and use it on len() to get the count.


Video Answer


2 Answers

You can use the str accessor for some list operations as well. In this example,

df['CreationDate'].str.len()

returns the length of each list. See the docs for str.len.

df['Length'] = df['CreationDate'].str.len()
df
Out: 
                                                    CreationDate  Length
2013-12-22 15:25:02                  [ubuntu, mac-osx, syslinux]       3
2009-12-14 14:29:32  [ubuntu, mod-rewrite, laconica, apache-2.2]       4
2013-12-22 15:42:00               [ubuntu, nat, squid, mikrotik]       4

For these operations, vanilla Python is generally faster. pandas handles NaNs though. Here are timings:

ser = pd.Series([random.sample(string.ascii_letters, 
                               random.randint(1, 20)) for _ in range(10**6)])

%timeit ser.apply(lambda x: len(x))
1 loop, best of 3: 425 ms per loop

%timeit ser.str.len()
1 loop, best of 3: 248 ms per loop

%timeit [len(x) for x in ser]
10 loops, best of 3: 84 ms per loop

%timeit pd.Series([len(x) for x in ser], index=ser.index)
1 loop, best of 3: 236 ms per loop
like image 199
ayhan Avatar answered Oct 12 '22 05:10

ayhan


  • pandas.Series.map(len) and pandas.Series.apply(len) are equivalent in execution time, and slightly faster than pandas.Series.str.len().

    • pandas.Series.map
    • pandas.Series.apply
    • pandas.Series.str.len
  • Difference between map, applymap and apply methods in Pandas

import pandas as pd

data = {'os': [['ubuntu', 'mac-osx', 'syslinux'], ['ubuntu', 'mod-rewrite', 'laconica', 'apache-2.2'], ['ubuntu', 'nat', 'squid', 'mikrotik']]}
index = ['2013-12-22 15:25:02', '2009-12-14 14:29:32', '2013-12-22 15:42:00']

df = pd.DataFrame(data, index)

# create Length column
df['Length'] = df.os.map(len)

# display(df)
                                                              os  Length
2013-12-22 15:25:02                  [ubuntu, mac-osx, syslinux]       3
2009-12-14 14:29:32  [ubuntu, mod-rewrite, laconica, apache-2.2]       4
2013-12-22 15:42:00               [ubuntu, nat, squid, mikrotik]       4

%timeit

import pandas as pd
import random
import string

random.seed(365)

ser = pd.Series([random.sample(string.ascii_letters, random.randint(1, 20)) for _ in range(10**6)])

%timeit ser.str.len()
252 ms ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit ser.map(len)
220 ms ± 7.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit ser.apply(len)
222 ms ± 8.31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
like image 24
Trenton McKinney Avatar answered Oct 12 '22 03:10

Trenton McKinney