Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Slicing a String after certain key words are mentioned into a list

I am new to python and I am stuck with a problem. What I'm trying to do that I have a string containing a conversation between two people :

str = "  dylankid: *random words* senpai: *random words* dylankid: *random words* senpai: *random words*"

I want to create 2 lists from the string using dylankid and senpai as names :

dylankid = [ ]
senpai = [ ]

and here is where I am struggling, inside list dylankid I want to place all the words that come after 'dylankid' in the string but before the next 'dylankid' or 'senpai' same goes for senpai list so it would look something like this

dylankid = ["random words", "random words", "random words"]
senpai = ["random words", "random words", "random words"]    

dylankid containing all the messages from dylankid and vice versa.

I have looked into slicing it and using split() and re.compile(), but I can't figure out a way to specify were to start slicing and where to stop.

Hopefully it was clear enough, any help would be appreciated :)

like image 722
Dylan Kilkenny Avatar asked Apr 10 '16 13:04

Dylan Kilkenny

3 Answers

Following code will create a dict where keys are persons and values are list of messages:

from collections import defaultdict
import re

    \s*                         # Any amount of space
    (dylankid|senpai)           # Capture person
    :\s                         # Colon and single space
    (.*?)                       # Capture everything, non-greedy
    (?=\sdylankid:|\ssenpai:|$) # Until we find following person or end of string
s = "  dylankid: *random words* senpai: *random words* dylankid: *random words* senpai: *random words*"
res = defaultdict(list)
for person, message in re.findall(PATTERN, s, re.VERBOSE):

print res['dylankid']
print res['senpai']

It will produce following output:

['*random words*', '*random words*']
['*random words*', '*random words*']
like image 91
niemmi Avatar answered Oct 24 '22 12:10


You can use a groupby, splitting the words and grouping using __contains__

s = "dylankid: *random words d* senpai: *random words s* dylankid: *random words d*  senpai: *random words s*"
from itertools import groupby

d = {"dylankid:": [], "senpai:":[]}

grps = groupby(s.split(" "), d.__contains__)

for k, v in grps:
    if k:
        d[next(v)].append(" ".join(next(grps)[1]))


{'dylankid:': ['*random words d*', '*random words d*'], 'senpai:': ['*random words s*', '*random words s*']}

Each time we get a name in our dict we use that name with next(v) them get the next grouping of words up to the next name using str.join to join back to a single string.

If you happened to have no words after a name, you can use empty lists as the default value for the next call:

s = "dylankid: *random words d* senpai: *random words s* dylankid: *random words d*  senpai: *random words s* senpai:"
from itertools import groupby

d = {"dylankid:": [], "senpai:":[]}
grps = groupby(s.split(" "), d.__contains__)

for k, v in grps:
    if k:
        d[next(v)].append(" ".join(next(grps,[[], []])[1]))

Some timings on larger strings:

In [15]: dy, sn = "dylankid:", " senpai:"

In [16]: t = " foo " * 1000

In [17]: s = "".join([dy + t + sn + t for _ in range(1000)])

In [18]: %%timeit
   ....: d = {"dylankid:": [], "senpai:": []}
   ....: grps = groupby(s.split(" "), d.__contains__)
   ....: for k, v in grps:
   ....:     if k:
   ....:         d[next(v)].append(" ".join(next(grps, [[], []])[1]))
1 loop, best of 3: 376 ms per loop

In [19]: %%timeit
   ....: PATTERN = '''
   ....:     \s*                         # Any amount of space
   ....:     (dylankid|senpai)           # Capture person
   ....:     :\s                         # Colon and single space
   ....:     (.*?)                       # Capture everything, non-greedy
   ....:     (?=\sdylankid:|\ssenpai:|$) # Until we find following person or end of string
   ....: '''
   ....: res = defaultdict(list)
   ....: for person, message in re.findall(PATTERN, s, re.VERBOSE):
   ....:     res[person].append(message)
1 loop, best of 3: 753 ms per loop

Both retuurn the same output:

In [20]: d = {"dylankid:": [], "senpai:": []}

In [21]: grps = groupby(s.split(" "), d.__contains__)

In [22]: for k, v in grps:
           if k:                                        
                d[next(v)].append(" ".join(next(grps, [[], []])[1]))

In [23]: PATTERN = '''
   ....:     \s*                         # Any amount of space
   ....:     (dylankid|senpai)           # Capture person
   ....:     :\s                         # Colon and single space
   ....:     (.*?)                       # Capture everything, non-greedy
   ....:     (?=\sdylankid:|\ssenpai:|$) # Until we find following person or end of string
   ....: '''

In [24]: res = defaultdict(list)

In [25]: for person, message in re.findall(PATTERN, s, re.VERBOSE):
   ....:         res[person].append(message)

In [26]: d["dylankid:"] == res["dylankid"]
Out[26]: True

In [27]: d["senpai:"] == res["senpai"]
Out[27]: True
like image 31
Padraic Cunningham Avatar answered Oct 24 '22 12:10

Padraic Cunningham

This can be tightened up, but it should be easy to extend to more user names.

from collections import defaultdict

# Input string
all_messages = "  dylankid: *random words* senpai: *random words* dylankid: *random words* senpai: *random words*"

# Expected users
users = ['dylankid', 'senpai']

starts = {'{}:'.format(x) for x in users}
D = defaultdict(list)
results = defaultdict(list)

# Read through the words in the input string, collecting the ones that follow a user name
current_user = None
for word in all_messages.split(' '):
    if word in starts:
        current_user = word[:-1]
    elif current_user:

# Join the collected words into messages
for user, all_parts in D.items():
    for part in all_parts:
        results[user].append(' '.join(part))

The results are:

    <class 'list'>,
    {'senpai': ['*random words*', '*random words*'],
    'dylankid': ['*random words*', '*random words*']}
like image 41
bbayles Avatar answered Oct 24 '22 13:10
