I have a csv file which goes like this:
index,labels
1,created the tower
2,destroyed the tower
3,created the swimming pool
4,destroyed the swimming pool
Now, if I pass the list of the columns I want in place of labels column (doesn't contain all the words in the label columns)
['created','tower','destroyed','swimming pool']
I want to obtain the dataframe as:
index,created,destroyed,tower,swimming pool
1,1,0,1,0
2,0,1,1,0
3,1,0,0,1
4,0,1,0,1
I looked into get_dummies, but that didnt help
import re
import pandas as pd
df = pd.DataFrame({'index': [1, 2, 3, 4], 'labels': ['created the tower', 'destroyed the tower', 'created the swimming pool', 'destroyed the swimming pool']})
columns = ['created','destroyed','tower','swimming pool']
pat = '|'.join(['({})'.format(re.escape(c)) for c in columns])
result = (df['labels'].str.extractall(pat)).groupby(level=0).count()
result.columns = columns
print(result)
yields
created destroyed tower swimming pool
0 1 0 1 0
1 0 1 1 0
2 1 0 0 1
3 0 1 0 1
Most of the work is done by str.extractall:
In [808]: df['labels'].str.extractall(r'(created)|(destroyed)|(tower)|(swimming pool)')
Out[808]:
0 1 2 3
match
0 0 created NaN NaN NaN
1 NaN NaN tower NaN
1 0 NaN destroyed NaN NaN
1 NaN NaN tower NaN
2 0 created NaN NaN NaN
1 NaN NaN NaN swimming pool
3 0 NaN destroyed NaN NaN
1 NaN NaN NaN swimming pool
Since each match is placed on its own row, the desired result can be obtained by doing a groupby/count operation where we group by the first level of the index (the original index).
Note that the Python re module has a hard-coded limit to the number of named groups allowed:
/usr/lib/python3.4/sre_compile.py in compile(p, flags)
577 if p.pattern.groups > 100:
578 raise AssertionError(
--> 579 "sorry, but this version only supports 100 named groups"
580 )
581
AssertionError: sorry, but this version only supports 100 named groups
This limits the extractall approach used above to a maximum of 100 keywords.
Here is a benchmark which suggests that cᴏʟᴅsᴘᴇᴇᴅ's solution (at least for a certain range of use cases) may be the fastest:
In [76]: %timeit using_contains(ser, keywords)
10 loops, best of 3: 63.4 ms per loop
In [77]: %timeit using_defchararray(ser, keywords)
10 loops, best of 3: 90.6 ms per loop
In [78]: %timeit using_extractall(ser, keywords)
10 loops, best of 3: 126 ms per loop
Here is the setup I used:
import string
import numpy as np
import pandas as pd
def using_defchararray(ser, keywords):
"""
https://stackoverflow.com/a/46046558/190597 (piRSquared)
"""
v = ser.values.astype(str)
# >>> (np.core.defchararray.find(v[:, None], columns) >= 0)
# array([[ True, False, True, False],
# [False, True, True, False],
# [ True, False, False, True],
# [False, True, False, True]], dtype=bool)
result = pd.DataFrame(
(np.core.defchararray.find(v[:, None], keywords) >= 0).astype(int),
index=ser.index, columns=keywords)
return result
def using_extractall(ser, keywords):
"""
https://stackoverflow.com/a/46046417/190597 (unutbu)
"""
pat = '|'.join(['({})'.format(re.escape(c)) for c in keywords])
result = (ser.str.extractall(pat)).groupby(level=0).count()
result.columns = keywords
return result
def using_contains(ser, keywords):
"""
https://stackoverflow.com/a/46046142/190597 (cᴏʟᴅsᴘᴇᴇᴅ)
"""
return (pd.concat([ser.str.contains(x) for x in keywords],
axis=1, keys=keywords).astype(int))
def make_random_str_array(letters=string.ascii_letters, strlen=10, size=100):
return (np.random.choice(list(letters), size*strlen)
.view('|U{}'.format(strlen)))
keywords = make_random_str_array(size=99)
arr = np.random.choice(keywords, size=(1000, 5),replace=True)
ser = pd.Series([' '.join(row) for row in arr])
Be sure to check benchmarks on your own machine, and with a setup similar to your use case. Results may vary due to numerous factors, such as size of the Series, ser, length of keywords, hardware, OS, version of NumPy, Pandas, and Python, and how they were compiled.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With