Given a list of animals, like: <pre class="prettyprint lang-py prettyprint-override"><code>animals = ['cat', 'dog', 'hamster', 'dolphin'] </code></pre> and pandas dataframe, df: <pre class="prettyprint"><code>id animals 1 dog,cat 2 dog 3 cat,dolphin 4 cat,dog 5 hamster,dolphin </code></pre> I want to get a new dataframe showing occurrence of each animal, something like: <pre class="prettyprint"><code>animal ids cat 1,3,4 dog 1,2,4 hamster 5 dolphin 3,5 </code></pre> I know I can run a loop and prepare it, but I have the list of over 80,000 words with dataframe of over 1 million rows, so it would take long to do it using loop. Is there an easier and faster method to get the result using dataframe?

Let us try <code>get_dummies</code> then <code>dot</code> <pre class="prettyprint"><code>df.animals.str.get_dummies(',').T.dot(df.id.astype(str)+',').str[:-1] Out[307]: cat 1,3,4 dog 1,2,4 dolphin 3,5 hamster 5 dtype: object </code></pre> If would considered the list add <code>reindex</code> <pre class="prettyprint"><code>df.animals.str.get_dummies(',').T.dot(df.id.astype(str)+',').str[:-1].reindex(animals) Out[308]: cat 1,3,4 dog 1,2,4 hamster 5 dolphin 3,5 dtype: object </code></pre>

get list of occurrences using pandas

Tags:

performance

python

pandas

dataframe

numpy

Given a list of animals, like:

animals = ['cat', 'dog', 'hamster', 'dolphin']

and pandas dataframe, df:

id    animals
1     dog,cat
2     dog
3     cat,dolphin
4     cat,dog
5     hamster,dolphin

I want to get a new dataframe showing occurrence of each animal, something like:

animal    ids
cat       1,3,4
dog       1,2,4
hamster   5        
dolphin   3,5

I know I can run a loop and prepare it, but I have the list of over 80,000 words with dataframe of over 1 million rows, so it would take long to do it using loop. Is there an easier and faster method to get the result using dataframe?

257

asked Aug 21 '20 20:08

Ahmet Cetin

3 Answers

Let us try get_dummies then dot

df.animals.str.get_dummies(',').T.dot(df.id.astype(str)+',').str[:-1]
Out[307]: 
cat        1,3,4
dog        1,2,4
dolphin      3,5
hamster        5
dtype: object

If would considered the list add reindex

df.animals.str.get_dummies(',').T.dot(df.id.astype(str)+',').str[:-1].reindex(animals)
Out[308]: 
cat        1,3,4
dog        1,2,4
hamster        5
dolphin      3,5
dtype: object

111

answered Oct 11 '22 06:10

BENY

NumPy based one for perf. -

def list_occ(df):
    id_col='id'
    item_col='animals'
    
    sidx = np.argsort(animals)
    s = [i.split(',') for i in df[item_col]]
    d = np.concatenate(s)
    
    p = sidx[np.searchsorted(animals, d, sorter=sidx)]
    C = np.bincount(p, minlength=len(animals))
    
    l = list(map(len,s))
    r = np.repeat(np.arange(len(l)), l)
    v = df[id_col].values[r[np.lexsort((r,p))]]
    
    out = pd.DataFrame({'ids':np.split(v, C[:-1].cumsum())}, index=animals)
    return out

Sample run -

In [41]: df
Out[41]: 
  id          animals
0  1          dog,cat
1  2              dog
2  3      cat,dolphin
3  4          cat,dog
4  5  hamster,dolphin

In [42]: animals
Out[42]: ['cat', 'dog', 'hamster', 'dolphin']

In [43]: list_occ(df)
Out[43]: 
               ids
cat      [1, 3, 4]
dog      [1, 2, 4]
hamster        [5]
dolphin     [3, 5]

Benchmarking

Using the given sample and simply scale up the number of items.

# Setup
N = 100 # scale factor
s = [i.split(',') for i in df['animals']]
df_big = pd.DataFrame({'animals':[[j+str(ID) for j in i] for i in s for ID in range(1,N+1)]})
df_big['id'] = range(1, len(df_big)+1)
animals = np.unique(np.concatenate(df_big.animals)).tolist()
df_big['animals'] = [','.join(i) for i in df_big.animals]
df = df_big

Timings -

# Using given df & scaling it up by replicating elems with progressive IDs
In [9]: N = 100 # scale factor
   ...: s = [i.split(',') for i in df['animals']]
   ...: df_big = pd.DataFrame({'animals':[[j+str(ID) for j in i] for i in s for ID in range(1,N+1)]})
   ...: df_big['id'] = range(1, len(df_big)+1)
   ...: animals = np.unique(np.concatenate(df_big.animals)).tolist()
   ...: df_big['animals'] = [','.join(i) for i in df_big.animals]
   ...: df = df_big

# @BEN_YO's soln-1
In [10]: %timeit df.animals.str.get_dummies(',').T.dot(df.id.astype(str)+',').str[:-1]
163 ms ± 2.94 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# @BEN_YO's soln-2
In [11]: %timeit df.animals.str.get_dummies(',').T.dot(df.id.astype(str)+',').str[:-1].reindex(animals)
166 ms ± 4.52 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# @Andy L.'s soln
%timeit (df.astype(str).assign(animals=df.animals.str.split(',')).explode('animals').groupby('animals').id.agg(','.join).reset_index())
13.4 ms ± 74 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [12]: %timeit list_occ(df)
2.81 ms ± 101 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

answered Oct 11 '22 04:10

Divakar

Use str.split, explode and agg.join

df_final = (df.astype(str).assign(animals=df.animals.str.split(','))
                          .explode('animals').groupby('animals').id.agg(','.join)
                          .reset_index())

Out[155]:
   animals     id
0      cat  1,3,4
1      dog  1,2,4
2  dolphin    3,5
3  hamster      5