Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex using increasing sequence of numbers Python

Say I have a string:

teststring =  "1.3 Hello how are you 1.4 I am fine, thanks 1.2 Hi There 1.5 Great!" 

That I would like as:

testlist = ["1.3 Hello how are you", "1.4 I am fine, thanks 1.2 Hi There", "1.5 Great!"]

Basically, splitting only on increasing digits where the difference is .1 (i.e. 1.2 to 1.3).

Is there a way to split this with regex but only capturing increasing sequential numbers? I wrote code in python to sequentially iterate through using a custom re.compile() for each one and it is okay but extremely unwieldy.

Something like this (where parts1_temp is a given list of the x.x. numbers in the string):

parts1_temp = ['1.3','1.4','1.2','1.5']
parts_num =  range(int(parts1_temp.split('.')[1]), int(parts1_temp.split('.')[1])+30)
parts_search = ['.'.join([parts1_temp.split('.')[0], str(parts_num_el)]) for parts_num_el in parts_num]
#parts_search should be ['1.3','1.4','1.5',...,'1.32']

for k in range(len(parts_search)-1):
    rxtemp = re.compile(r"(?:"+str(parts_search[k])+")([\s\S]*?)(?=(?:"+str(parts_search[k+1])+"))", re.MULTILINE)
    parts_fin = [match.group(0) for match in rxtemp.finditer(teststring)]

But man is it ugly. Is there a way to do this more directly in regex? I imagine this is feature that someone would have wanted at some point with regex but I can't find any ideas on how to tackle this (and maybe it is not possible with pure regex).

like image 545
sfortney Avatar asked Feb 16 '18 22:02

sfortney


3 Answers

Doing this with a regex only seems overly complex. What about this processing:

import re

teststring =  "1.3 Hello how are you 1.4 I am fine, thanks 1.2 Hi There 1.5 Great!" 
res = []
expected = None
for s in re.findall(r'\d+(?:\.\d+)?|\D+', teststring):
    if s[0].isdigit() and expected is None:
        expected = s
        fmt = '{0:.' + str(max(0, len(s) - (s+'.').find('.') - 1)) + 'f}'
        inc = float(re.sub(r'\d', '0', s)[0:-1] + '1')
    if s == expected:
        res.append(s)
        expected = fmt.format(float(s) + inc)
    elif expected:
        res[-1] = res[-1] + s

print (res)

This also works if the numbers happen to have 2 decimals or more, or none.

like image 162
trincot Avatar answered Oct 08 '22 18:10

trincot


You can also mutate the string so that a marker is placed next to the digit if it is part of the increasing sequence. Then, you can split at that marker:

import re
teststring =  "1.3 Hello how are you 1.4 I am fine, thanks 1.2 Hi There 1.5 Great!" 
numbers = re.findall('[\.\d]+', teststring)
final_string = re.sub('[\.\d]+', '{}', teststring).format(*[numbers[0]]+[numbers[i] if numbers[i] < numbers[i-1] else '*'+numbers[i] for i in range(1, len(numbers))]).split(' *')

Output:

['1.3 Hello how are you', '1.4 I am fine, thanks 1.2 Hi There', '1.5 Great!']
like image 43
Ajax1234 Avatar answered Oct 08 '22 17:10

Ajax1234


This method uses finditer to find all locations of \d+\.\d+, then tests whether the match was numerically greater than the previous. If the test is true it appends the index to the indices array.

The last line uses list comprehension as taken from this answer to split the string on those given indices.

Original Method

This method ensures the previous match is smaller than the current one. This doesn't work sequentially, instead, it works based on number size. So assuming a string has the numbers 1.1, 1.2, 1.4, it would split on each occurrence since each number is larger than the last.

See code in use here

import re

indices = []
string =  "1.3 Hello how are you 1.4 I am fine, thanks 1.2 Hi There 1.5 Great!"
regex = re.compile(r"\d+\.\d+")
lastFloat = 0

for m in regex.finditer(string):
    x = float(m.group())
    if lastFloat < x:
        lastFloat = x
        indices.append(m.start(0))

print([string[i:j] for i,j in zip(indices, indices[1:]+[None])])

Outputs: ['1.3 Hello how are you ', '1.4 I am fine, thanks 1.2 Hi There ', '1.5 Great!']


Edit

Sequential Method

This method is very similar to the original, however, on the case of 1.1, 1.2, 1.4, it wouldn't split on 1.4 since it doesn't follow sequentially given the .1 sequential separator.

The method below only differs in the if statement, so this logic is fairly customizable to whatever your needs may be.

See code in use here

import re

indices = []
string =  "1.3 Hello how are you 1.4 I am fine, thanks 1.2 Hi There 1.5 Great!"
regex = re.compile(r"\d+\.\d+")
lastFloat = 0

for m in regex.finditer(string):
    x = float(m.group())
    if (lastFloat == 0) or (x == round(lastFloat + .1, 1)):
        lastFloat = x
        indices.append(m.start(0))

print([string[i:j] for i,j in zip(indices, indices[1:]+[None])])
like image 38
ctwheels Avatar answered Oct 08 '22 17:10

ctwheels