Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What the most efficient way to search substring through a Pandas DataFrame?

I have a Pandas Dataframe containing 75k rows of text (approx. 350 chars per row). I need to search the occurrences of a list of 45k substrings within that dataframe.

Expected output is a authors_data dict containing the list of authors and the number of occurrences. Following code assumes I have a dataframe['text'] column and a list of substrings named authors_list.

authors_data = {}
for author in authors_list:
    count = 0
    for i, row in df.iterrows():
         if author in row.text:
             count += 1
authors_data[author] = count
print(author, authors_data[author])

I ran some initial tests, 10 authors took me about 50 seconds. The complete table will take me a few days to run. So I'm looking at more time efficient ways to run the code.

Is df.iterrows() fast enough? Are there any specific libraries that I should look into?

Let me know!

like image 884
sovnheim Avatar asked Jan 27 '23 14:01

sovnheim


2 Answers

I tried this and it's doing what you are looking for. You could test and see if it's faster.

for author in authors_list:
            authors_data[author] = df['AUTHORCOL'].map(lambda x: author in x).sum()
like image 127
Olivier Turcotte Avatar answered Jan 31 '23 09:01

Olivier Turcotte


#1 Delimited values

If your authors are clearly delineated, e.g. comma-separated in each series element, you can use collections.Counter with itertools.chain:

from collections import Counter
from itertools import chain

res = Counter(chain.from_iterable(df['Authors'].str.split(',').map(set)))

# Counter({'Frank Herbert': 1, 'George Orwell': 2, 'John Steinbeck': 1,
#          'John Williams': 2, 'Philip K Dick': 1, 'Philip Roth': 1,
#          'Ursula K Le Guin': 1})

#2 Arbitrary strings

Of course, such structured data isn't always available. If your series elements are strings with arbitrary data and your list of pre-defined authors is small, you can use pd.Series.str.contains.

L = ['George Orwell', 'John Steinbeck', 'Frank Herbert', 'John Williams']

res = {i: df['Authors'].str.contains(i, regex=False).sum() for i in L}

# {'Frank Herbert': 1, 'George Orwell': 2, 'John Steinbeck': 1, 'John Williams': 2}

This works because pd.Series.str.contains returns a series of Boolean values, which you can then sum since True is considered equivalent to 1 with most numeric computations in Python / Pandas. We turn off regex to improve performance.

Performance

Pandas string-based methods are notoriously slow. You can instead use sum with a generator expression and the in operator for an extra speed-up:

df = pd.concat([df]*100000)

%timeit {i: df['Authors'].str.contains(i, regex=False).sum() for i in L}    # 420 ms
%timeit {i: sum(i in x for x in df['Authors'].values) for i in L}           # 235 ms
%timeit {i: df['Authors'].map(lambda x: i in x).sum() for i in L}           # 424 ms

Notice, for scenario #1, Counter methods are actually more expensive because they require splitting as a preliminary step:

chainer = chain.from_iterable

%timeit Counter(chainer([set(i.split(',')) for i in df['Authors'].values]))  # 650 ms
%timeit Counter(chainer(df['Authors'].str.split(',').map(set)))              # 828 ms

Further improvements

  1. Solutions for scenario #2 are not perfect, since they won't differentiate (for example) between John Williams and John Williamson. You may wish to use a specialist package if this kind of differentiation is important to you.
  2. For both #1 and #2 you may wish to consider the Aho-Corasick algorithm. There is one example implementation, although more work may be required for a count of elements found within each row.

Setup

df = pd.DataFrame({'Authors': ['Ursula K Le Guin,Philip K Dick,Frank Herbert,Ursula K Le Guin',
                               'John Williams,Philip Roth,John Williams,George Orwell',
                               'George Orwell,John Steinbeck,George Orwell,John Williams']})
like image 20
jpp Avatar answered Jan 31 '23 08:01

jpp