I have a DataFrame
in pandas
with a column called df.strings
with strings of text. I would like to get the individual words of those strings on their own rows with identical values for the other columns. For example if I have 3 strings (and an unrelated column, Time):
Strings Time
0 The dog 4Pm
1 lazy dog 2Pm
2 The fox 1Pm
I want new rows containing the words from the string, but with otherwise identical columns
Strings --- Words ---Time
"The dog" --- "The" --- 4Pm
"The dog" --- "dog" --- 4Pm
"lazy dog"--- "lazy"--- 2Pm
"lazy dog"--- "dog" --- 2Pm
"The fox" --- "The" --- 1Pm
"The fox" --- "fox" --- 1Pm
I know how to split the words up from the strings:
string_list = '\n'.join(df.Strings.map(str))
word_list = re.findall('[a-z]+', Strings)
But how can I get these into the dataframe while preserving the index & other variables? I'm using Python 2.7 and pandas 0.10.1.
EDIT: I now understand how to expand rows using groupby found in this question:
def f(group):
row = group.irow(0)
return DataFrame({'words': re.findall('[a-z]+',row['Strings'])})
df.groupby('class', group_keys=False).apply(f)
I would still like to preserve the other columns. Is this possible?
Here is my code that doesn't use groupby()
, I think it's faster.
import pandas as pd
import numpy as np
import itertools
df = pd.DataFrame({
"strings":["the dog", "lazy dog", "The fox jump"],
"value":["a","b","c"]})
w = df.strings.str.split()
c = w.map(len)
idx = np.repeat(c.index, c.values)
#words = np.concatenate(w.values)
words = list(itertools.chain.from_iterable(w.values))
s = pd.Series(words, index=idx)
s.name = "words"
print df.join(s)
Thre result:
strings value words
0 the dog a the
0 the dog a dog
1 lazy dog b lazy
1 lazy dog b dog
2 The fox jump c The
2 The fox jump c fox
2 The fox jump c jump
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With