I wrote code that gets text-tokens as input:
tokens = ["Tap-", "Berlin", "Was-ISt", "das", "-ist", "cool", "oh", "Man", "-Hum", "-Zuh-UH-", "glit"]
The code should find all tokens that contain hyphens or are connected to each other with hyphens: Basically the output should be:
[["Tap-", "Berlin"], ["Was-ISt"], ["das", "-ist"], ["Man", "-Hum", "-Zuh-UH-", "glit"]]
I wrote a code, but somehow Im not getting the with hypens connected Tokens back:
def find_hyphens(self):
    tokens_with_hypens =[]
    for i in range(len(self.tokens)):
        hyp_leng = 0
        while self.hypen_between_two_tokens(i + hyp_leng):
            hyp_leng += 1
        if self.has_hypen_in_middle(i) or hyp_leng > 0:
            if hyp_leng == 0:
                tokens_with_hypens.append(self.tokens[i:i + 1])
            else:
                tokens_with_hypens.append(self.tokens[i:i + hyp_leng])
                i += hyp_leng - 1
    return tokens_with_hypens
What do I wrong? Is there a more performant solution? Thanks
I found 3 mistakes in your code:
You are comparing the last 2 characters of tok1 here, rather than the last of tok1 and the first of tok2:
 if "-" in joined[len(tok1) - 2: len(tok1)]:
 # instead, do this:
 if "-" in joined[len(tok1) - 1: len(tok1) + 1]:
You are omitting the last matching token here. Increase the end-index of your slice here by 1:
 tokens_with_hypens.append(self.tokens[i:i + hyp_leng])
 # instead, do this:
 tokens_with_hypens.append(self.tokens[i:i + 1 + hyp_leng])
You cannot manipulate the index of a for i in range loop in python. the next iteration will just retrieve the next index, overwriting your change. Instead, you could use a while-loop like this:
 i = 0
 while i < len(self.tokens):
     [...]
     i += 1
These 3 corrections lead to your test passing
Nonetheless I couldn't resist to write an algorithm from scratch, solving your problem as simple as possible:
def get_hyphen_groups(tokens):
    i_start, i_end = 0, 1
    while i_start < len(tokens):
        while (i_end < len(tokens) and
              (tokens[i_end].startswith("-") ^ tokens[i_end - 1].endswith("-"))):
            i_end += 1
        yield tokens[i_start:i_end]
        i_start, i_end = i_end, i_end + 1
    
    
tokens = ["Tap-", "Berlin", "Was-ISt", "das", "-ist", "cool", "oh", "Man", "-Hum", "-Zuh-UH-", "glit"]
for group in get_hyphen_groups(tokens):
    print ("".join(group))
To exclude 1-element-groups, like in your expected result, wrap the yield into this if:
if i_end - i_start > 1:
    yield tokens[i_start:i_end]
To include 1-element-groups that already include a hyphen, change that if to this for example:
if i_end - i_start > 1 or "-" in tokens[i_start]:
    yield tokens[i_start:i_end]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With