Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Simplify regular expression for time literals (like "10h50m")

I am writing lexer rules for a custom description language using pyLR1 which shall include time literals like for example:

10h30m     # meaning 10 hours + 30 minutes
5m30s      # meaning 5 minutes + 30 seconds
10h20m15s  # meaning 10 hours + 20 minutes + 15 seconds
15.6s      # meaning 15.6 seconds

The order of specification for hour, minute and second parts shall be fixed to h, m, s. To specify this in detail, I want the following valid combinations hms, hm, h, ms, m and s (with numbers between the different segments of course). As a bonus the regex should check for decimal (i.e. non-natural) numbers in the segments and only allow these in the segment with least significance.

So I have for all but the last group a number match like:

([0-9]+)

And for the last group even:

([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)  # to allow for .5 and 0.5 and 5.0 and 5

Going through all the combinations of h, m and s a cute little python script gives me the following regex:

(([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)h|([0-9]+)h([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)m|([0-9]+)h([0-9]+)m([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)s|([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)m|([0-9]+)m([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)s|([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)s) 

Obviously, this is a little bit of horror expression. Is there any way to simplify this? The answer must work with pythons re module and I will also accept answers which do not work with pyLR1 if its due to its restricted subset of regular expressions.

like image 496
Jonas Schäfer Avatar asked Jul 02 '12 11:07

Jonas Schäfer


3 Answers

You can factorise your regular expression, using the notation h, m, s to denote each of the subregexes, the most basic version is:

h|hm|hms|ms|m|s

which is what you have currently. You can break this into:

(h|hm|hms)|(ms|m)|s

and then pulling out h from the first expression and m from the second we get (using (x|) == x?):

h(m|ms)?|ms?|s

Continuing on we get to

h(ms?)?|ms?|s

which is probably simpler (and probably the simplest).


Adding in the regex d to denote decimals (as in \.[0-9]+), this could be written as

h(d|m(d|sd?)?)?|m(d|sd?)?|sd?

(i.e. at each stage optionally have either decimals, or a continuation to the next of h m or s.)

This would result in something like (for just hours and minutes):

[0-9]+((\.[0-9]+)?h|h[0-9]+(\.[0-9]+)?m)|[0-9]+(\.[0-9]+)?m

Looking at this, it might not be possible to get into a form ameniable for pyLR1, so doing the parsing with decimals in every spot and then a secondary check might be the best way to do this.

like image 133
huon Avatar answered Oct 05 '22 06:10

huon


the below representation should be understandable, I dont know the exact regex syntax you're using, so you have to "translate" to the valid syntax yourself.

your hours

 [0-9]{1,2}h

your minutes

[0-9]{1,2}m

your seconds

[0-9]{1,2}(\.[0-9]{1,3})?s

you want all those in order, and able to omit any of those (wrap with ?)

([0-9]{1,2}h)?([0-9]{1,2}m)?([0-9]{1,2}(\.[0-9]{1,3})?s)?

this however matches things like: 10h30s
that is valid combinations are hms, hm, hs, h, ms, m and s
or iow, minutes can be ommited, but still have hours and seconds.

the other problem is if the empty string is given, it is matched, as all three ? make that valid. so you have to work around this somehow. hmm


looking at @dbaupp h(ms?)?|ms?|s you can take the above and match:

h: [0-9]{1,2}h
m: [0-9]{1,2}m
s: [0-9]{1,2}(\.[0-9]{1,3})?s

so you get to:

h(ms?)?: ([0-9]{1,2}h([0-9]{1,2}m([0-9]{1,2}(\.[0-9]{1,3})?s)?)?
  ms?  :              [0-9]{1,2}m([0-9]{1,2}(\.[0-9]{1,3})?s)?
   s   :                          [0-9]{1,2}(\.[0-9]{1,3})?s

all those OR'd together give you a big but easy to break down regex:

([0-9]{1,2}h([0-9]{1,2}m([0-9]{1,2}(\.[0-9]{1,3})?s)?)?|[0-9]{1,2}m([0-9]{1,2}(\.[0-9]{1,3})?s)?|[0-9]{1,2}(\.[0-9]{1,3})?s

which get you away with both the empty string problem and the match of hs.


looking at @Donal Fellows comment on @dbaupp answer, I'll also do (h?m)?S|h?M|H

(h?m)?s: (([0-9]{1,2}h)?[0-9]{1,2}m)?[0-9]{1,2}(\.[0-9]{1,3})?s
 h?m   :  ([0-9]{1,2}h)?[0-9]{1,2}m
 h     :   [0-9]{1,2}h

and merged together, you end up with something smaller than the above:

(([0-9]{1,2}h)?[0-9]{1,2}m)?[0-9]{1,2}(\.[0-9]{1,3})?s|([0-9]{1,2}h)?[0-9]{1,2}m|[0-9]{1,2}h

now we have to find a way to match .xx demical representation

like image 22
c00kiemon5ter Avatar answered Oct 05 '22 06:10

c00kiemon5ter


Here is a short Python expression that works:

(\d+h)?(\d+m)?(\d*\.\d+|\d+(\.\d*)?)(?(2)s|(?(1)m|[hms]))

Inspired by Cameron Martins answer based on conditionals.

Explained:

(\d+h)?                 # optional int "h" (capture 1)
(\d+m)?                 # optional int "m" (capture 2)
(\d*\.\d+|\d+(\.\d*)?)  # int or decimal 
(?(2)                   # if "m" (capture 2) was matched:
  s                       # "s"
| (?(1)                 # else if "h" (capture 1) was matched:
  m                       # "m"
|                       # else (nothing matched):
  [hms]))                 # any of the "h", "m" or "s"
like image 27
Qtax Avatar answered Oct 05 '22 07:10

Qtax