I have a word list like following.
wordlist = ['p1','p2','p3','p4','p5','p6','p7']
And the dataframe is like following.
df = pd.DataFrame({'id' : [1,2,3,4],
'path' : ["p1,p2,p3,p4","p1,p2,p1","p1,p5,p5,p7","p1,p2,p3,p3"]})
output:
id path
1 p1,p2,p3,p4
2 p1,p2,p1
3 p1,p5,p5,p7
4 p1,p2,p3,p3
I want to count the path data to get following output. Is it possible to get this kind of transformation?
id p1 p2 p3 p4 p5 p6 p7
1 1 1 1 1 0 0 0
2 2 1 0 0 0 0 0
3 1 0 0 0 2 0 1
4 1 1 2 0 0 0 0
In Pandas, You can get the count of each row of DataFrame using DataFrame. count() method. In order to get the row count you should use axis='columns' as an argument to the count() method.
You can use len(df. index) to find the number of rows in pandas DataFrame, df. index returns RangeIndex(start=0, stop=8, step=1) and use it on len() to get the count.
Utilizing the Len() Method with One Condition We will first apply the condition on a single column to retrieve the number of rows that matches the condition. Then, we apply it to the multiple columns of the DataFrame. For both techniques, we utilized the “len()” method of Pandas.
value_counts() function returns object containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element.
I think this would be efficient
# create Series with dictionaries
>>> from collections import Counter
>>> c = df["path"].str.split(',').apply(Counter)
>>> c
0 {u'p2': 1, u'p3': 1, u'p1': 1, u'p4': 1}
1 {u'p2': 1, u'p1': 2}
2 {u'p1': 1, u'p7': 1, u'p5': 2}
3 {u'p2': 1, u'p3': 2, u'p1': 1}
# create DataFrame
>>> pd.DataFrame({n: c.apply(lambda x: x.get(n, 0)) for n in wordlist})
p1 p2 p3 p4 p5 p6 p7
0 1 1 1 1 0 0 0
1 2 1 0 0 0 0 0
2 1 0 0 0 2 0 1
3 1 1 2 0 0 0 0
Another way to do this:
>>> dfN = df["path"].str.split(',').apply(lambda x: pd.Series(Counter(x)))
>>> pd.DataFrame(dfN, columns=wordlist).fillna(0)
p1 p2 p3 p4 p5 p6 p7
0 1 1 1 1 0 0 0
1 2 1 0 0 0 0 0
2 1 0 0 0 2 0 1
3 1 1 2 0 0 0 0
Some rough tests for performance:
>>> dfL = pd.concat([df]*100)
>>> timeit('c = dfL["path"].str.split(",").apply(Counter); d = pd.DataFrame({n: c.apply(lambda x: x.get(n, 0)) for n in wordlist})', 'from __main__ import dfL, wordlist; import pandas as pd; from collections import Counter', number=100)
0.7363274283027295
>>> timeit('splitted = dfL["path"].str.split(","); d = pd.DataFrame({name : splitted.apply(lambda x: x.count(name)) for name in wordlist})', 'from __main__ import dfL, wordlist; import pandas as pd', number=100)
0.5305424618886718
# now let's make wordlist larger
>>> wordlist = wordlist + list(lowercase) + list(uppercase)
>>> timeit('c = dfL["path"].str.split(",").apply(Counter); d = pd.DataFrame({n: c.apply(lambda x: x.get(n, 0)) for n in wordlist})', 'from __main__ import dfL, wordlist; import pandas as pd; from collections import Counter', number=100)
1.765344003293876
>>> timeit('splitted = dfL["path"].str.split(","); d = pd.DataFrame({name : splitted.apply(lambda x: x.count(name)) for name in wordlist})', 'from __main__ import dfL, wordlist; import pandas as pd', number=100)
2.33328927599905
after reading this topic I've found that Counter
is really slow. You can optimize it a bit by using defaultdict
:
>>> def create_dict(x):
... d = defaultdict(int)
... for c in x:
... d[c] += 1
... return d
>>> c = df["path"].str.split(",").apply(create_dict)
>>> pd.DataFrame({n: c.apply(lambda x: x[n]) for n in wordlist})
p1 p2 p3 p4 p5 p6 p7
0 1 1 1 1 0 0 0
1 2 1 0 0 0 0 0
2 1 0 0 0 2 0 1
3 1 1 2 0 0 0 0
and some tests:
>>> timeit('c = dfL["path"].str.split(",").apply(create_dict); d = pd.DataFrame({n: c.apply(lambda x: x[n]) for n in wordlist})', 'from __main__ import dfL, wordlist, create_dict; import pandas as pd; from collections import defaultdict', number=100)
0.45942801555111146
# now let's make wordlist larger
>>> wordlist = wordlist + list(lowercase) + list(uppercase)
>>> timeit('c = dfL["path"].str.split(",").apply(create_dict); d = pd.DataFrame({n: c.apply(lambda x: x[n]) for n in wordlist})', 'from __main__ import dfL, wordlist, create_dict; import pandas as pd; from collections import defaultdict', number=100)
1.5798653213942089
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With