pandas dataframe count row values

Tags:

I have a word list like following.

wordlist = ['p1','p2','p3','p4','p5','p6','p7']

And the dataframe is like following.

df = pd.DataFrame({'id' : [1,2,3,4],
                'path'  : ["p1,p2,p3,p4","p1,p2,p1","p1,p5,p5,p7","p1,p2,p3,p3"]})

output:

    id path

    1 p1,p2,p3,p4
    2 p1,p2,p1
    3 p1,p5,p5,p7
    4 p1,p2,p3,p3

I want to count the path data to get following output. Is it possible to get this kind of transformation?

id p1 p2 p3 p4 p5 p6 p7
1  1  1  1  1  0  0  0
2  2  1  0  0  0  0  0
3  1  0  0  0  2  0  1
4  1  1  2  0  0  0  0

453

asked Dec 04 '13 07:12

Nilani Algiriyage

1 Answers

I think this would be efficient

# create Series with dictionaries
>>> from collections import Counter
>>> c = df["path"].str.split(',').apply(Counter)
>>> c
0    {u'p2': 1, u'p3': 1, u'p1': 1, u'p4': 1}
1                        {u'p2': 1, u'p1': 2}
2              {u'p1': 1, u'p7': 1, u'p5': 2}
3              {u'p2': 1, u'p3': 2, u'p1': 1}

# create DataFrame
>>> pd.DataFrame({n: c.apply(lambda x: x.get(n, 0)) for n in wordlist})
   p1  p2  p3  p4  p5  p6  p7
0   1   1   1   1   0   0   0
1   2   1   0   0   0   0   0
2   1   0   0   0   2   0   1
3   1   1   2   0   0   0   0

update

Another way to do this:

>>> dfN = df["path"].str.split(',').apply(lambda x: pd.Series(Counter(x)))
>>> pd.DataFrame(dfN, columns=wordlist).fillna(0)
   p1  p2  p3  p4  p5  p6  p7
0   1   1   1   1   0   0   0
1   2   1   0   0   0   0   0
2   1   0   0   0   2   0   1
3   1   1   2   0   0   0   0

update 2

Some rough tests for performance:

>>> dfL = pd.concat([df]*100)
>>> timeit('c = dfL["path"].str.split(",").apply(Counter); d = pd.DataFrame({n: c.apply(lambda x: x.get(n, 0)) for n in wordlist})', 'from __main__ import dfL, wordlist; import pandas as pd; from collections import Counter', number=100)
0.7363274283027295

>>> timeit('splitted = dfL["path"].str.split(","); d = pd.DataFrame({name : splitted.apply(lambda x: x.count(name)) for name in wordlist})', 'from __main__ import dfL, wordlist; import pandas as pd', number=100)
0.5305424618886718

# now let's make wordlist larger
>>> wordlist = wordlist + list(lowercase) + list(uppercase)

>>> timeit('c = dfL["path"].str.split(",").apply(Counter); d = pd.DataFrame({n: c.apply(lambda x: x.get(n, 0)) for n in wordlist})', 'from __main__ import dfL, wordlist; import pandas as pd; from collections import Counter', number=100)
1.765344003293876

>>> timeit('splitted = dfL["path"].str.split(","); d = pd.DataFrame({name : splitted.apply(lambda x: x.count(name)) for name in wordlist})', 'from __main__ import dfL, wordlist; import pandas as pd', number=100)
2.33328927599905

update 3

after reading this topic I've found that Counter is really slow. You can optimize it a bit by using defaultdict:

>>> def create_dict(x):
...     d = defaultdict(int)
...     for c in x:
...         d[c] += 1
...     return d
>>> c = df["path"].str.split(",").apply(create_dict)
>>> pd.DataFrame({n: c.apply(lambda x: x[n]) for n in wordlist})
   p1  p2  p3  p4  p5  p6  p7
0   1   1   1   1   0   0   0
1   2   1   0   0   0   0   0
2   1   0   0   0   2   0   1
3   1   1   2   0   0   0   0

and some tests:

>>> timeit('c = dfL["path"].str.split(",").apply(create_dict); d = pd.DataFrame({n: c.apply(lambda x: x[n]) for n in wordlist})', 'from __main__ import dfL, wordlist, create_dict; import pandas as pd; from collections import defaultdict', number=100)
0.45942801555111146

# now let's make wordlist larger
>>> wordlist = wordlist + list(lowercase) + list(uppercase)
>>> timeit('c = dfL["path"].str.split(",").apply(create_dict); d = pd.DataFrame({n: c.apply(lambda x: x[n]) for n in wordlist})', 'from __main__ import dfL, wordlist, create_dict; import pandas as pd; from collections import defaultdict', number=100)
1.5798653213942089

177

answered Sep 29 '22 19:09

Roman Pekar

Related questions
                            
                                How do I create an array slice using the NumPy C API?
                            
                                How to analyse bitmap image in python, using PIL?
                            
                                Add key to dict with setattr() in Python
                            
                                Capture image for processing
                            
                                Use Latent Semantic Analysis with sklearn
                            
                                2.2GB JSON file parses inconsistently
                            
                                Ideal data structure with fast lookup, fast update and easy comparison/sorting
                            
                                How can I run Python source from stdin that itself reads from stdin?
                            
                                jsonpickle datetime to readable json format
                            
                                Python: Choose One Item from Every List but Make Every Possible Combination
                            
                                python cache dictionary - counting number of hits
                            
                                Django templates built-in filters: Using a variable value in an argument
                            
                                How to create a fixed size (unsigned) integer in python?
                            
                                Efficiently find row intersections of two 2-D numpy arrays
                            
                                Flask WTF-forms adding select and textarea
                            
                                Strange behavior of Python 'is' operator if combined with 'in' [duplicate]
                            
                                Efficiently Subtract Vector from Matrix (Scipy)
                            
                                How to wrap (monkey patch) @classmethod
                            
                                Remove/Rewrite HTTP header 'Server: TwistedWeb'
                            
                                python, "urlparse.urlparse(url).hostname" return None value

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

pandas dataframe count row values

Tags:

python

pandas

dataframe