I'm doing some natural language processing, and I have a MultiIndexed DataFrame that looks something like this (except there are actually about 3,000 rows):
Title N-grams
Period Date
2015-01 2015-01-01 22:00:10 SIRF: Simultaneous Image Registration and Fusi... [@SENTBEGIN paper, paper propose, propose nove...
2015-01-02 16:54:13 Generic construction of scale-invariantly coar... [@SENTBEGIN encode, encode temporal, temporal ...
2015-01-04 00:07:00 Understanding Trajectory Behavior: A Motion Pa... [@SENTBEGIN mining, mining underlie, underlie ...
2015-01-04 09:07:45 Hostile Intent Identification by Movement Patt... [@SENTBEGIN the, the recent, recent year, year...
2015-01-04 14:35:58 A New Method for Signal and Image Analysis: Th... [@SENTBEGIN brief, brief review, review provid...
What I want to do is to count how many times each n-gram appears in each month (hence the first index, "Period"). Doing that is rather straightforward, if time-consuming (and because each cell in the "N-grams" column is a list, I'm not sure much could be done to speed it up). I create a new DataFrame to hold the counts, using this code:
# Create the frequencies DataFrame.
period_index = ngrams.index.unique(level = "Period")
freqs = DataFrame(index = period_index)
# Count the n-grams in each period.
for period in period_index:
for ngrams_list in ngrams.loc[period, "N-grams"]:
for ngram in ngrams_list:
if not ngram in freqs.columns:
freqs[ngram] = 0
freqs.loc[period, ngram] += 1
The logic is pretty simple: if the n-gram in question has been seen already (there's a column for it in the "freqs"), increment the count by 1. If it hasn't been seen, create a new column of 0's for that n-gram, and then increment as normal. In the vast majority of cases, this works fine, but for a tiny fraction of n-grams, I get this error when the loop hits the increment line:
KeyError: u'the label [7 85.40] is not in the [index]'
(Sorry for the lack of a proper stack trace--I'm doing this in a Zeppelin Notebook, and Zeppelin doesn't give a proper stack trace.)
A little more debugging showed that, in these cases, the creation of the new column fails silently (that is, it doesn't work, but it doesn't return an exception, either).
It might be worth noting that in an earlier version of the code, I was using "loc" to assign directly to a cell in a newly created column, rather than creating the column first, like this:
if not ngram in freqs.columns:
freqs.loc[period, ngram] = 1
I changed this because it caused problems by assigning NaN's for that n-gram to all the other periods, but the direct assignment choked on exactly the same n-grams as with the new code.
By wrapping the increment line in a try/except block, I've discovered that the error is extremely rare: it occurs for about 20 out of a total of over 100,000 n-grams in the corpus. Here are some examples:
"7 85.40"
"2014 july"
"2010 3.4"
"and 77"
"1997 and"
"and 2014"
"6 2008"
"879 --"
"-- 894"
"2003 -"
"- 2014"
Most of the 20 include digits, but at least one is entirely letters (two words separated by a space--it's not in the list above, because I re-ran the script while typing up this question, and didn't get all the way to that point), and plenty of digits-only n-grams don't cause problems. Most of the problematic ones involve years, which, on the face of it, might suggest some sort of confusion with the DataFrame's DatetimeIndex (given that a DatetimeIndex accepts partial matches), but that wouldn't explain the non-dates, especially the ones beginning with letters.
Despite the unlikelihood of the DatetimeIndex conflict, I tried a different method of creating each new column (as suggested by an answer to Adding new column to existing DataFrame in Python pandas), using "loc" to avoid any confusion between rows and columns:
freqs.loc[:, ngram] = Series(0, index = freqs.index)
...but that meets with exactly the same fate as my original code that created each new column implicitly by assigning to a non-existent column:
KeyError: u'7 85.40'
Next, I tried the DataFrame.assign method (suggested in the same answer cited above, though I needed to add a workaround suggested by an answer to pandas assign with new column name as string):
kwarg = {ngram: 0}
freqs = freqs.assign(**kwarg)
Alas, that produces exactly the same error.
Does anyone have any insights on why this might be happening? Given the rarity, I suppose I could just ignore the problematic n-grams, but it would be good to understand what's going on.
A nested for
loop is not recommended, or required. You can use MultiLabelBinarizer
from the sklearn.preprocessing
library to provide one-hot encoding, then use groupby
+ sum
with the results and join to your original dataframe.
Here's a demonstration:
df = df.set_index(['L1', 'L2'])
row_counts = df['values'].apply(pd.Series.value_counts).fillna(0).astype(int)
# alternative if above does not work
row_counts = df['values'].apply(lambda x: pd.Series(x).value_counts(sort=False))\
.fillna(0).astype(int)
row_counts_grouped = row_counts.groupby(level='L1').sum()
df = df.join(row_counts_grouped, how='inner')
print(df)
values a b c d e g
L1 L2
1 1 [a, a, c] 3 2 2 1 1 0
2 [b, c, d] 3 2 2 1 1 0
3 [a, b, e] 3 2 2 1 1 0
2 1 [a, e, g] 1 2 1 2 2 1
2 [b, d, d] 1 2 1 2 2 1
3 [e, b, c] 1 2 1 2 2 1
We don't consider duplicate values on a row with this solution:
from sklearn.preprocessing import MultiLabelBinarizer
df = pd.DataFrame([[1,1,['a','a','c']], [1,2,['b','c','d']], [1,3,['a','b','e']],
[2,1,['a','e','g']], [2,2,['b','d','d']], [2,3,['e','b','c']]],
columns=['L1', 'L2', 'values'])
df = df.set_index(['L1', 'L2'])
mlb = MultiLabelBinarizer()
onehot = pd.DataFrame(mlb.fit_transform(df['values']),
columns=mlb.classes_,
index=df.index.get_level_values('L1'))
onehot_grouped = onehot.groupby(level='L1').sum()
df = df.join(onehot_grouped, how='inner')
print(df)
values a b c d e g
L1 L2
1 1 [a, a, c] 2 2 2 1 1 0
2 [b, c, d] 2 2 2 1 1 0
3 [a, b, e] 2 2 2 1 1 0
2 1 [a, e, g] 1 2 1 1 2 1
2 [b, d, d] 1 2 1 1 2 1
3 [e, b, c] 1 2 1 1 2 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With