Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas generate columns from single column of strings

Tags:

python

pandas

I have a csv file which goes like this:

index,labels
1,created the tower
2,destroyed the tower
3,created the swimming pool
4,destroyed the swimming pool

Now, if I pass the list of the columns I want in place of labels column (doesn't contain all the words in the label columns)

['created','tower','destroyed','swimming pool']

I want to obtain the dataframe as:

index,created,destroyed,tower,swimming pool
1,1,0,1,0
2,0,1,1,0
3,1,0,0,1
4,0,1,0,1

I looked into get_dummies, but that didnt help

like image 405
GKS Avatar asked Sep 05 '17 02:09

GKS


1 Answers

import re
import pandas as pd
df = pd.DataFrame({'index': [1, 2, 3, 4], 'labels': ['created the tower', 'destroyed the tower', 'created the swimming pool', 'destroyed the swimming pool']})

columns = ['created','destroyed','tower','swimming pool']
pat = '|'.join(['({})'.format(re.escape(c)) for c in columns])
result = (df['labels'].str.extractall(pat)).groupby(level=0).count()
result.columns = columns
print(result)

yields

   created  destroyed  tower  swimming pool
0        1          0      1              0
1        0          1      1              0
2        1          0      0              1
3        0          1      0              1

Most of the work is done by str.extractall:

In [808]: df['labels'].str.extractall(r'(created)|(destroyed)|(tower)|(swimming pool)')
Out[808]: 
               0          1      2              3
  match                                          
0 0      created        NaN    NaN            NaN
  1          NaN        NaN  tower            NaN
1 0          NaN  destroyed    NaN            NaN
  1          NaN        NaN  tower            NaN
2 0      created        NaN    NaN            NaN
  1          NaN        NaN    NaN  swimming pool
3 0          NaN  destroyed    NaN            NaN
  1          NaN        NaN    NaN  swimming pool

Since each match is placed on its own row, the desired result can be obtained by doing a groupby/count operation where we group by the first level of the index (the original index).


Note that the Python re module has a hard-coded limit to the number of named groups allowed:

/usr/lib/python3.4/sre_compile.py in compile(p, flags)
    577     if p.pattern.groups > 100:
    578         raise AssertionError(
--> 579             "sorry, but this version only supports 100 named groups"
    580             )
    581 

AssertionError: sorry, but this version only supports 100 named groups

This limits the extractall approach used above to a maximum of 100 keywords.


Here is a benchmark which suggests that cᴏʟᴅsᴘᴇᴇᴅ's solution (at least for a certain range of use cases) may be the fastest:

In [76]: %timeit using_contains(ser, keywords)
10 loops, best of 3: 63.4 ms per loop

In [77]: %timeit using_defchararray(ser, keywords)
10 loops, best of 3: 90.6 ms per loop

In [78]: %timeit using_extractall(ser, keywords)
10 loops, best of 3: 126 ms per loop

Here is the setup I used:

import string
import numpy as np
import pandas as pd

def using_defchararray(ser, keywords):
    """
    https://stackoverflow.com/a/46046558/190597 (piRSquared)
    """
    v = ser.values.astype(str)
    # >>> (np.core.defchararray.find(v[:, None], columns) >= 0)
    # array([[ True, False,  True, False],
    #        [False,  True,  True, False],
    #        [ True, False, False,  True],
    #        [False,  True, False,  True]], dtype=bool)

    result = pd.DataFrame(
        (np.core.defchararray.find(v[:, None], keywords) >= 0).astype(int),
        index=ser.index, columns=keywords)
    return result

def using_extractall(ser, keywords):
    """
    https://stackoverflow.com/a/46046417/190597 (unutbu)
    """
    pat = '|'.join(['({})'.format(re.escape(c)) for c in keywords])
    result = (ser.str.extractall(pat)).groupby(level=0).count()
    result.columns = keywords
    return result

def using_contains(ser, keywords):
    """
    https://stackoverflow.com/a/46046142/190597 (cᴏʟᴅsᴘᴇᴇᴅ)
    """
    return (pd.concat([ser.str.contains(x) for x in keywords], 
                      axis=1, keys=keywords).astype(int))

def make_random_str_array(letters=string.ascii_letters, strlen=10, size=100):
    return (np.random.choice(list(letters), size*strlen)
            .view('|U{}'.format(strlen)))

keywords = make_random_str_array(size=99)
arr = np.random.choice(keywords, size=(1000, 5),replace=True)
ser = pd.Series([' '.join(row) for row in arr])

Be sure to check benchmarks on your own machine, and with a setup similar to your use case. Results may vary due to numerous factors, such as size of the Series, ser, length of keywords, hardware, OS, version of NumPy, Pandas, and Python, and how they were compiled.

like image 135
unutbu Avatar answered Sep 19 '22 09:09

unutbu