I have the following:
import pandas as pd
import numpy as np
documents = [['Human', 'machine', 'interface'],
['A', 'survey', 'of', 'user'],
['The', 'EPS', 'user'],
['System', 'and', 'human'],
['Relation', 'of', 'user'],
['The', 'generation'],
['The', 'intersection'],
['Graph', 'minors'],
['Graph', 'minors', 'a']]
df = pd.DataFrame({'date': np.array(['2014-05-01', '2014-05-02', '2014-05-10', '2014-05-10', '2014-05-15', '2014-05-15', '2014-05-20', '2014-05-20', '2014-05-20'], dtype=np.datetime64), 'text': documents})
There are only 5 unique days. I would like to group by day to end up with the following:
documents2 = [['Human', 'machine', 'interface'],
['A', 'survey', 'of', 'user'],
['The', 'EPS', 'user', 'System', 'and', 'human'],
['Relation', 'of', 'user', 'The', 'generation'],
['The', 'intersection', 'Graph', 'minors', 'Graph', 'minors', 'a']]
df2 = pd.DataFrame({'date': np.array(['2014-05-01', '2014-05-02', '2014-05-10', '2014-05-15', '2014-05-20'], dtype=np.datetime64), 'text': documents2})
IIUC, you can aggregate
by sum
df.groupby('date').text.sum() # or .agg(sum)
date
2014-05-01 [Human, machine, interface]
2014-05-02 [A, survey, of, user]
2014-05-10 [The, EPS, user, System, and, human]
2014-05-15 [Relation, of, user, The, generation]
2014-05-20 [The, intersection, Graph, minors, Graph, mino...
Name: text, dtype: object
Or flatten your list using list comprehension, which yields same time complexity as chain.from_iterable
but has no dependency on one more external library
df.groupby('date').text.agg(lambda x: [item for z in x for item in z])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With