Simplify regular expression for time literals (like "10h50m")

Question

I am writing lexer rules for a custom description language using pyLR1 which shall include time literals like for example:

10h30m     # meaning 10 hours + 30 minutes
5m30s      # meaning 5 minutes + 30 seconds
10h20m15s  # meaning 10 hours + 20 minutes + 15 seconds
15.6s      # meaning 15.6 seconds

The order of specification for hour, minute and second parts shall be fixed to h, m, s. To specify this in detail, I want the following valid combinations hms, hm, h, ms, m and s (with numbers between the different segments of course). As a bonus the regex should check for decimal (i.e. non-natural) numbers in the segments and only allow these in the segment with least significance.

So I have for all but the last group a number match like:

([0-9]+)

And for the last group even:

([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)  # to allow for .5 and 0.5 and 5.0 and 5

Going through all the combinations of h, m and s a cute little python script gives me the following regex:

(([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)h|([0-9]+)h([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)m|([0-9]+)h([0-9]+)m([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)s|([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)m|([0-9]+)m([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)s|([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)s)

Obviously, this is a little bit of horror expression. Is there any way to simplify this? The answer must work with pythons re module and I will also accept answers which do not work with pyLR1 if its due to its restricted subset of regular expressions.

huon · Accepted Answer

You can factorise your regular expression, using the notation h, m, s to denote each of the subregexes, the most basic version is:

h|hm|hms|ms|m|s

which is what you have currently. You can break this into:

(h|hm|hms)|(ms|m)|s

and then pulling out h from the first expression and m from the second we get (using (x|) == x?):

h(m|ms)?|ms?|s

Continuing on we get to

h(ms?)?|ms?|s

which is probably simpler (and probably the simplest).

Adding in the regex d to denote decimals (as in \.[0-9]+), this could be written as

h(d|m(d|sd?)?)?|m(d|sd?)?|sd?

(i.e. at each stage optionally have either decimals, or a continuation to the next of h m or s.)

This would result in something like (for just hours and minutes):

[0-9]+((\.[0-9]+)?h|h[0-9]+(\.[0-9]+)?m)|[0-9]+(\.[0-9]+)?m

Looking at this, it might not be possible to get into a form ameniable for pyLR1, so doing the parsing with decimals in every spot and then a secondary check might be the best way to do this.

c00kiemon5ter · Answer

the below representation should be understandable, I dont know the exact regex syntax you're using, so you have to "translate" to the valid syntax yourself.

your hours

 [0-9]{1,2}h

your minutes

[0-9]{1,2}m

your seconds

[0-9]{1,2}(\.[0-9]{1,3})?s

you want all those in order, and able to omit any of those (wrap with ?)

([0-9]{1,2}h)?([0-9]{1,2}m)?([0-9]{1,2}(\.[0-9]{1,3})?s)?

this however matches things like: 10h30s
that is valid combinations are hms, hm, hs, h, ms, m and s
or iow, minutes can be ommited, but still have hours and seconds.

the other problem is if the empty string is given, it is matched, as all three ? make that valid. so you have to work around this somehow. hmm

looking at @dbaupp h(ms?)?|ms?|s you can take the above and match:

h: [0-9]{1,2}h
m: [0-9]{1,2}m
s: [0-9]{1,2}(\.[0-9]{1,3})?s

so you get to:

h(ms?)?: ([0-9]{1,2}h([0-9]{1,2}m([0-9]{1,2}(\.[0-9]{1,3})?s)?)?
  ms?  :              [0-9]{1,2}m([0-9]{1,2}(\.[0-9]{1,3})?s)?
   s   :                          [0-9]{1,2}(\.[0-9]{1,3})?s

all those OR'd together give you a big but easy to break down regex:

([0-9]{1,2}h([0-9]{1,2}m([0-9]{1,2}(\.[0-9]{1,3})?s)?)?|[0-9]{1,2}m([0-9]{1,2}(\.[0-9]{1,3})?s)?|[0-9]{1,2}(\.[0-9]{1,3})?s

which get you away with both the empty string problem and the match of hs.

looking at @Donal Fellows comment on @dbaupp answer, I'll also do (h?m)?S|h?M|H

(h?m)?s: (([0-9]{1,2}h)?[0-9]{1,2}m)?[0-9]{1,2}(\.[0-9]{1,3})?s
 h?m   :  ([0-9]{1,2}h)?[0-9]{1,2}m
 h     :   [0-9]{1,2}h

and merged together, you end up with something smaller than the above:

(([0-9]{1,2}h)?[0-9]{1,2}m)?[0-9]{1,2}(\.[0-9]{1,3})?s|([0-9]{1,2}h)?[0-9]{1,2}m|[0-9]{1,2}h

now we have to find a way to match .xx demical representation

Qtax · Answer

Here is a short Python expression that works:

(\d+h)?(\d+m)?(\d*\.\d+|\d+(\.\d*)?)(?(2)s|(?(1)m|[hms]))

Inspired by Cameron Martins answer based on conditionals.

Explained:

(\d+h)?                 # optional int "h" (capture 1)
(\d+m)?                 # optional int "m" (capture 2)
(\d*\.\d+|\d+(\.\d*)?)  # int or decimal 
(?(2)                   # if "m" (capture 2) was matched:
  s                       # "s"
| (?(1)                 # else if "h" (capture 1) was matched:
  m                       # "m"
|                       # else (nothing matched):
  [hms]))                 # any of the "h", "m" or "s"

Simplify regular expression for time literals (like "10h50m")

Tags:

regex

time

parsing

lexical-analysis

Jonas Schäfer

3 Answers

huon

c00kiemon5ter

Explained:

Qtax

Recent Activity

Donate For Us

Simplify regular expression for time literals (like "10h50m")

Tags:

regex

time

parsing

lexical-analysis

Jonas Schäfer

3 Answers

huon

c00kiemon5ter

Explained:

Qtax

Related questions

Recent Activity

Donate For Us