I am new to python and I am stuck with a problem. What I'm trying to do that I have a string containing a conversation between two people :
str = " dylankid: *random words* senpai: *random words* dylankid: *random words* senpai: *random words*"
I want to create 2 lists from the string using dylankid and senpai as names :
dylankid = [ ]
senpai = [ ]
and here is where I am struggling, inside list dylankid I want to place all the words that come after 'dylankid' in the string but before the next 'dylankid' or 'senpai' same goes for senpai list so it would look something like this
dylankid = ["random words", "random words", "random words"]
senpai = ["random words", "random words", "random words"]
dylankid containing all the messages from dylankid and vice versa.
I have looked into slicing it and using split()
and re.compile()
, but I can't figure out a way to specify were to start slicing and where to stop.
Hopefully it was clear enough, any help would be appreciated :)
Following code will create a dict where keys are persons and values are list of messages:
from collections import defaultdict
import re
PATTERN = '''
\s* # Any amount of space
(dylankid|senpai) # Capture person
:\s # Colon and single space
(.*?) # Capture everything, non-greedy
(?=\sdylankid:|\ssenpai:|$) # Until we find following person or end of string
'''
s = " dylankid: *random words* senpai: *random words* dylankid: *random words* senpai: *random words*"
res = defaultdict(list)
for person, message in re.findall(PATTERN, s, re.VERBOSE):
res[person].append(message)
print res['dylankid']
print res['senpai']
It will produce following output:
['*random words*', '*random words*']
['*random words*', '*random words*']
You can use a groupby, splitting the words and grouping using __contains__
s = "dylankid: *random words d* senpai: *random words s* dylankid: *random words d* senpai: *random words s*"
from itertools import groupby
d = {"dylankid:": [], "senpai:":[]}
grps = groupby(s.split(" "), d.__contains__)
for k, v in grps:
if k:
d[next(v)].append(" ".join(next(grps)[1]))
print(d)
Output:
{'dylankid:': ['*random words d*', '*random words d*'], 'senpai:': ['*random words s*', '*random words s*']}
Each time we get a name in our dict we use that name with next(v)
them get the next grouping of words up to the next name using str.join
to join back to a single string.
If you happened to have no words after a name, you can use empty lists as the default value for the next call:
s = "dylankid: *random words d* senpai: *random words s* dylankid: *random words d* senpai: *random words s* senpai:"
from itertools import groupby
d = {"dylankid:": [], "senpai:":[]}
grps = groupby(s.split(" "), d.__contains__)
for k, v in grps:
if k:
d[next(v)].append(" ".join(next(grps,[[], []])[1]))
print(d)
Some timings on larger strings:
In [15]: dy, sn = "dylankid:", " senpai:"
In [16]: t = " foo " * 1000
In [17]: s = "".join([dy + t + sn + t for _ in range(1000)])
In [18]: %%timeit
....: d = {"dylankid:": [], "senpai:": []}
....: grps = groupby(s.split(" "), d.__contains__)
....: for k, v in grps:
....: if k:
....: d[next(v)].append(" ".join(next(grps, [[], []])[1]))
....:
1 loop, best of 3: 376 ms per loop
In [19]: %%timeit
....: PATTERN = '''
....: \s* # Any amount of space
....: (dylankid|senpai) # Capture person
....: :\s # Colon and single space
....: (.*?) # Capture everything, non-greedy
....: (?=\sdylankid:|\ssenpai:|$) # Until we find following person or end of string
....: '''
....: res = defaultdict(list)
....: for person, message in re.findall(PATTERN, s, re.VERBOSE):
....: res[person].append(message)
....:
1 loop, best of 3: 753 ms per loop
Both retuurn the same output:
In [20]: d = {"dylankid:": [], "senpai:": []}
In [21]: grps = groupby(s.split(" "), d.__contains__)
In [22]: for k, v in grps:
if k:
d[next(v)].append(" ".join(next(grps, [[], []])[1]))
....:
In [23]: PATTERN = '''
....: \s* # Any amount of space
....: (dylankid|senpai) # Capture person
....: :\s # Colon and single space
....: (.*?) # Capture everything, non-greedy
....: (?=\sdylankid:|\ssenpai:|$) # Until we find following person or end of string
....: '''
In [24]: res = defaultdict(list)
In [25]: for person, message in re.findall(PATTERN, s, re.VERBOSE):
....: res[person].append(message)
....:
In [26]: d["dylankid:"] == res["dylankid"]
Out[26]: True
In [27]: d["senpai:"] == res["senpai"]
Out[27]: True
This can be tightened up, but it should be easy to extend to more user names.
from collections import defaultdict
# Input string
all_messages = " dylankid: *random words* senpai: *random words* dylankid: *random words* senpai: *random words*"
# Expected users
users = ['dylankid', 'senpai']
starts = {'{}:'.format(x) for x in users}
D = defaultdict(list)
results = defaultdict(list)
# Read through the words in the input string, collecting the ones that follow a user name
current_user = None
for word in all_messages.split(' '):
if word in starts:
current_user = word[:-1]
D[current_user].append([])
elif current_user:
D[current_user][-1].append(word)
# Join the collected words into messages
for user, all_parts in D.items():
for part in all_parts:
results[user].append(' '.join(part))
The results are:
defaultdict(
<class 'list'>,
{'senpai': ['*random words*', '*random words*'],
'dylankid': ['*random words*', '*random words*']}
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With