I have a Pandas Dataframe containing 75k rows of text (approx. 350 chars per row). I need to search the occurrences of a list of 45k substrings within that dataframe.
Expected output is a authors_data
dict
containing the list of authors and the number of occurrences. Following code assumes I have a dataframe['text']
column and a list of substrings named authors_list
.
authors_data = {}
for author in authors_list:
count = 0
for i, row in df.iterrows():
if author in row.text:
count += 1
authors_data[author] = count
print(author, authors_data[author])
I ran some initial tests, 10 authors took me about 50 seconds. The complete table will take me a few days to run. So I'm looking at more time efficient ways to run the code.
Is df.iterrows()
fast enough? Are there any specific libraries that I should look into?
Let me know!
I tried this and it's doing what you are looking for. You could test and see if it's faster.
for author in authors_list:
authors_data[author] = df['AUTHORCOL'].map(lambda x: author in x).sum()
If your authors are clearly delineated, e.g. comma-separated in each series element, you can use collections.Counter
with itertools.chain
:
from collections import Counter
from itertools import chain
res = Counter(chain.from_iterable(df['Authors'].str.split(',').map(set)))
# Counter({'Frank Herbert': 1, 'George Orwell': 2, 'John Steinbeck': 1,
# 'John Williams': 2, 'Philip K Dick': 1, 'Philip Roth': 1,
# 'Ursula K Le Guin': 1})
Of course, such structured data isn't always available. If your series elements are strings with arbitrary data and your list of pre-defined authors is small, you can use pd.Series.str.contains
.
L = ['George Orwell', 'John Steinbeck', 'Frank Herbert', 'John Williams']
res = {i: df['Authors'].str.contains(i, regex=False).sum() for i in L}
# {'Frank Herbert': 1, 'George Orwell': 2, 'John Steinbeck': 1, 'John Williams': 2}
This works because pd.Series.str.contains
returns a series of Boolean values, which you can then sum since True
is considered equivalent to 1
with most numeric computations in Python / Pandas. We turn off regex
to improve performance.
Pandas string-based methods are notoriously slow. You can instead use sum
with a generator expression and the in
operator for an extra speed-up:
df = pd.concat([df]*100000)
%timeit {i: df['Authors'].str.contains(i, regex=False).sum() for i in L} # 420 ms
%timeit {i: sum(i in x for x in df['Authors'].values) for i in L} # 235 ms
%timeit {i: df['Authors'].map(lambda x: i in x).sum() for i in L} # 424 ms
Notice, for scenario #1, Counter
methods are actually more expensive because they require splitting as a preliminary step:
chainer = chain.from_iterable
%timeit Counter(chainer([set(i.split(',')) for i in df['Authors'].values])) # 650 ms
%timeit Counter(chainer(df['Authors'].str.split(',').map(set))) # 828 ms
John Williams
and John Williamson
. You may wish to use a specialist package if this kind of differentiation is important to you.Setup
df = pd.DataFrame({'Authors': ['Ursula K Le Guin,Philip K Dick,Frank Herbert,Ursula K Le Guin',
'John Williams,Philip Roth,John Williams,George Orwell',
'George Orwell,John Steinbeck,George Orwell,John Williams']})
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With