Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I parse free-text time intervals in Python, ranging from years to seconds?

Tags:

python

time

I would like to parse free-text time intervals like the following, using Python:

  • 1 second
  • 2 minutes
  • 3 hours
  • 4 days
  • 5 weeks
  • 6 months
  • 7 years

Is there a painless way to do this, ideally by simply calling a library function?

I have tried:

  • dateutil.parser.parse(), which understands seconds through hours but not days or more.
  • mx.DateTime.DateTimeDeltaFrom(), which understands through days but fails on weeks or higher, and silently (e.g., it might create an interval of length 0, or parse "2 months" as 2 minutes).
like image 545
Reid Avatar asked Mar 19 '12 18:03

Reid


3 Answers

This can be done with an external parsing library, or the built-in re module.

External parsing library

We can write a parser. It doesn't make a huge difference which parser is used. I searched for "python parser" and chose lark because it appeared near the top of the search results.

First, I defined the units as a mapping. This is where more units could be added, if "centuries" or "microseconds" are needed.

Note: For very small or large numbers, keep in mind timedelta.resolution

units = {
    "second": timedelta(seconds=1),
    "minute": timedelta(minutes=1),
    "hour":   timedelta(hours=1),
    "day":    timedelta(days=1),
    "week":   timedelta(weeks=1),
    "month":  timedelta(days=30),
    "year":   timedelta(days=365),
}

Next, the grammar is defined using lark's variant of EBNF. Here, WS hopefully matches all whitespace:

time_interval_grammar = r"""
%import common.WS
%import common.NUMBER

?interval: time+
time: value unit _separator?
value: NUMBER -> number
unit: SECOND
    | MINUTE
    | HOUR
    | DAY
    | WEEK
    | MONTH
    | YEAR
_separator: (WS | ",")+

SECOND: /s\w*/i
MINUTE: /mi\w*/i
HOUR:   /h\w*/i
DAY:    /d\w*/i
WEEK:   /w\w*/i
MONTH:  /mo\w*/i
YEAR:   /y\w*/i

%ignore WS
%ignore ","
"""

The grammar should allow arbitrary time intervals to be chained together, with or without commas as separators.

Each time interval's unit can be given as the shortest unique prefix:

second -> s
minute -> mi
hour   -> h
day    -> d
week   -> w
month  -> mo
year   -> y

Including the ones in the original question, these will serve as the target examples we want to parse:

1 second
2 minutes
3 hours
4 days
5 weeks
6 months
7 years

1 month, 7 years, 2 days, 30 hours, 0.05 seconds
0.0003 years, 100000 seconds
3y 4mo 9min 6d
1mo,3d 1.3e2 hours, 0.04yrs 2mi444

Lastly, I followed one of the lark tutorials and used a transformer:

class IntervalToTimedelta(Transformer):
    def interval(tree: List[timedelta]) -> timedelta:
        "sums all timedeltas"
        return reduce(add, tree, timedelta(seconds=0))

    def time(tree: List[Union[float, timedelta]]) -> timedelta:
        "returns a timedelta representing the "
        return mul(*tree)

    def unit(tokens: List[Token]) -> timedelta:
        """
        converts a unit into a timedelta that represents 1 of the unit type
        """
        return units[tokens[0].type.lower()]

    def number(tokens: List[Token]) -> float:
        "returns the value as a python type"
        return float(tokens[0].value)

The grammar is interpreted by lark.Lark. Since it is compatible with lark's LALR(1) parser, that parser is specified to gain some speed and improve memory efficiency by allowing the transformer to be used directly by the parser:

time_interval_parser = Lark(
    grammar=time_interval_grammar,
    start="interval",
    parser="lalr",
    transformer=IntervalToTimedelta,
)

This produces a mostly working parser. The complete answer.py file is this:

"""
Example parsing date and time interval with lark
"""
from datetime import timedelta
from functools import reduce
from operator import add, mul
from typing import List, Union

from lark import Lark, Token, Transformer

__all__ = [
    "examples",
    "IntervalToTimedelta",
    "parse",
]

examples = list(
    filter(
        None,
        """
1 second
2 minutes
3 hours
4 days
5 weeks
6 months
7 years

1 month, 0.05 weeks
0.003y, 100000secs
3y 4mo 9min 6d
1mo,3d 1.3e2 hours,
0.04yrs 2miasdf
""".splitlines(),
    )
)

units = {
    "second": timedelta(seconds=1),
    "minute": timedelta(minutes=1),
    "hour": timedelta(hours=1),
    "day": timedelta(days=1),
    "week": timedelta(weeks=1),
    "month": timedelta(days=30),
    "year": timedelta(days=365),
}

time_interval_grammar = r"""
%import common.WS
%import common.NUMBER

?interval: time+
time: value unit _separator?
value: NUMBER -> number
unit: SECOND
    | MINUTE
    | HOUR
    | DAY
    | WEEK
    | MONTH
    | YEAR
_separator: (WS | ",")+

SECOND: /s\w*/i
MINUTE: /mi\w*/i
HOUR:   /h\w*/i
DAY:    /d\w*/i
WEEK:   /w\w*/i
MONTH:  /mo\w*/i
YEAR:   /y\w*/i

%ignore WS
%ignore ","
"""


class IntervalToTimedelta(Transformer):
    def interval(tree: List[timedelta]) -> timedelta:
        "sums all timedeltas"
        return reduce(add, tree, timedelta(seconds=0))

    def time(tree: List[Union[float, timedelta]]) -> timedelta:
        "returns a timedelta representing the "
        return mul(*tree)

    def unit(tokens: List[Token]) -> timedelta:
        """
        converts a unit into a timedelta that represents 1 of the unit type
        """
        return units[tokens[0].type.lower()]

    def number(tokens: List[Token]) -> float:
        "returns the value as a python type"
        return float(tokens[0].value)


time_interval_parser = Lark(
    grammar=time_interval_grammar,
    start="interval",
    parser="lalr",
    transformer=IntervalToTimedelta,
)

parse = time_interval_parser.parse


if __name__ == "__main__":
    parsed_examples = [(example, parse(example)) for example in examples]
    longest_example = max(map(lambda tup: len(tup[0]), parsed_examples))
    longest_formatted = max(map(lambda tup: len(f"{tup[1]!s}"), parsed_examples))
    longest_parsed = max(map(lambda tup: len(f"<{tup[1]!r}>"), parsed_examples))
    for example, parsed_example in parsed_examples:
        print(
            f"{example: <{longest_example}s} -> "
            f"{parsed_example!s: <{longest_formatted}s} "
            f"{'<' + repr(parsed_example) + '>': >{longest_parsed}s}"
        )

Running it runs through the examples:

$ python .\answer.py
1 second            -> 0:00:01                         <datetime.timedelta(seconds=1)>
2 minutes           -> 0:02:00                       <datetime.timedelta(seconds=120)>
3 hours             -> 3:00:00                     <datetime.timedelta(seconds=10800)>
4 days              -> 4 days, 0:00:00                    <datetime.timedelta(days=4)>
5 weeks             -> 35 days, 0:00:00                  <datetime.timedelta(days=35)>
6 months            -> 180 days, 0:00:00                <datetime.timedelta(days=180)>
7 years             -> 2555 days, 0:00:00              <datetime.timedelta(days=2555)>
1 month, 0.05 weeks -> 30 days, 8:24:00   <datetime.timedelta(days=30, seconds=30240)>
0.003y, 100000secs  -> 2 days, 6:03:28     <datetime.timedelta(days=2, seconds=21808)>
3y 4mo 9min 6d      -> 1221 days, 0:09:00 <datetime.timedelta(days=1221, seconds=540)>
1mo,3d 1.3e2 hours, -> 38 days, 10:00:00  <datetime.timedelta(days=38, seconds=36000)>
0.04yrs 2miasdf     -> 14 days, 14:26:00  <datetime.timedelta(days=14, seconds=51960)>

This works fine, and the performance is adequate:

$ python -m timeit -s "from answer import parse, examples" "for example in examples:" " parse(example)"
500 loops, best of 5: 415 usec per loop

Potential improvements

Currently, this does not have any error handling, though this is by ommission: lark does raise errors, so the parse() function could catch any that can be handled gracefully.

Some other downsides to this particular implementation:

  • Doesn't type check with mypy --strict
  • It requires the use of a 3rd-party library
  • The grammar could better shape the resulting parse tree

Built-in Regular Expressions

Alternatively, instead of using a library for parsing, regular expressions can be used with the builtin re.

This has a few disadvantages:

  • Regular expressions are challenging to make flexible
  • Complex regular expressions are difficult to read
  • Regular expressions generally take longer for a human to interpret

It can be faster, though, and should only need the standard library included in CPython.

Using the previous example as a starting point, this is one way regular expressions could be swapped in:

"""
Example parsing date and time interval with re
"""
import re
from datetime import timedelta
from functools import reduce
from operator import add, mul
from typing import List, Tuple

__all__ = [
    "examples",
    "parse",
]

examples = list(
    filter(
        None,
        """
1 second
2 minutes
3 hours
4 days
5 weeks
6 months
7 years

1 month, 0.05 weeks
0.003y, 100000secs
3y 4mo 9min 6d
1mo,3d 1.3e2 hours,
0.04yrs 2miasdf
""".splitlines(),
    )
)


comma = ","
ws = r"\s"
separator = fr"[{ws}{comma}]+"


def unit_name(string: str) -> re.Pattern:
    return re.compile(fr"{string}\w*")


second = unit_name("s")
minute = unit_name("mi")
hour = unit_name("h")
day = unit_name("d")
week = unit_name("w")
month = unit_name("mo")
year = unit_name("y")
units = {
    second: timedelta(seconds=1),
    minute: timedelta(minutes=1),
    hour: timedelta(hours=1),
    day: timedelta(days=1),
    week: timedelta(weeks=1),
    month: timedelta(days=30),
    year: timedelta(days=365),
}
unit = re.compile(
    "("
    + "|".join(
        regex.pattern for regex in [second, minute, hour, day, week, month, year]
    )
    + ")"
)
digit = r"\d"
integer = fr"({digit}+)"
decimal = fr"({integer}\.({integer})?|\.{integer})"
signed_integer = fr"([+-]?{integer})"
exponent = fr"([eE]{signed_integer})"
float_ = fr"({integer}{exponent}|{decimal}({exponent})?)"
number = re.compile(fr"({float_}|{integer})")
time = re.compile(fr"(?P<number>{number.pattern}){ws}*(?P<unit>{unit.pattern})")
interval = re.compile(fr"({time.pattern}({separator})*)+", flags=re.IGNORECASE)


def normalize_unit(text: str) -> timedelta:
    "maps units to their respective timedelta"
    if not unit.match(text):
        raise ValueError(f"Not a unit: {text}")

    for unit_re in units:
        if unit_re.match(text):
            return units[unit_re]

    raise ValueError(f"No matching unit found: {text}")


def parse(text: str) -> timedelta:
    if not interval.match(text):
        raise ValueError(f"Parser Error: {text}")

    parsed_pairs: List[Tuple[float, timedelta]] = list()
    for match in time.finditer(text):
        parsed_number = float(match["number"])
        parsed_unit = normalize_unit(match["unit"])
        parsed_pairs.append((parsed_number, parsed_unit))

    timedeltas = [mul(*pair) for pair in parsed_pairs]

    return reduce(add, timedeltas, timedelta(seconds=0))


if __name__ == "__main__":
    parsed_examples = [(example, parse(example)) for example in examples]
    longest_example = max(map(lambda tup: len(tup[0]), parsed_examples))
    longest_formatted = max(map(lambda tup: len(f"{tup[1]!s}"), parsed_examples))
    longest_parsed = max(map(lambda tup: len(f"<{tup[1]!r}>"), parsed_examples))
    for example, parsed_example in parsed_examples:
        print(
            f"{example: <{longest_example}s} -> "
            f"{parsed_example!s: <{longest_formatted}s} "
            f"{'<' + repr(parsed_example) + '>': >{longest_parsed}s}"
        )

The number parsing is mimicked from lark's builtin grammar definitions.

The performance for this is better:

$ python -m timeit -s "from answer_re import parse, examples" "for example in examples:" " parse(example)"
2000 loops, best of 5: 109 usec per loop

But it's less readable, and making changes to maintain it will require more work.

Notes

As-is, both examples behave in a way that doesn't quite match up with how humans expect time intervals to work:

>>> from answer_re import parse
>>> from datetime import datetime
>>> datetime(2000, 1, 1) + parse("9 years")
datetime.datetime(2008, 12, 29, 0, 0)
>>> str(_)
'2008-12-29 00:00:00'

Compare this to what most people would expect it to be:

9 years from 2000-01-01 results in 2009-01-01

This stack overflow question provides a few solutions, one of which uses dateutil. Both of the examples above can be adapted by modifying the units mapping to use appropriate relativedelta's.

This is what the first example would look like:

...

units = {
    "second": relativedelta(seconds=1),
    "minute": relativedelta(minutes=1),
    "hour": relativedelta(hours=1),
    "day": relativedelta(days=1),
    "week": relativedelta(weeks=1),
    "month": relativedelta(months=1),
    "year": relativedelta(years=1),
}

...

This returns what's expected:

>>> from answer_with_dateutil import parse
>>> from datetime import datetime
>>> datetime(2000, 1, 1) + parse("9 years")
datetime.datetime(2009, 1, 1, 0, 0)
>>> str(_)
'2009-01-01 00:00:00'

Also, the use of f-strings and type annotations restricts this to Python 3.6 and up, though this can be changed to use str.format instead for Python 3.5+.

Conclusion

With the currently accepted answer in the running, this is the performance for the more normal examples given in the original question:

Note: for sh, replace ` with \ in the following commands

$ python -m timeit -s "from answer import examples;examples = examples[:7]" `
    -s "from parsedatetime import Calendar; from datetime import datetime" `
    -s "parse = Calendar().parseDT; now = datetime.now()" `
    "for example in examples:" " parse(example)[0] - now"
1000 loops, best of 5: 232 usec per loop

$ python -m timeit -s "from answer_re import examples;examples = examples[:7]" `
    -s "from answer import parse" `
    "for example in examples:" " parse(example)"
2000 loops, best of 5: 157 usec per loop

$ python -m timeit -s "from answer_re import examples;examples = examples[:7]" `
    -s "from answer_re import parse" `
    "for example in examples:" " parse(example)"
10000 loops, best of 5: 39.5 usec per loop

The performance differences are largely negligible for a large variety of use cases.

Currently, the easiest one to use is going to be the example given in the currently accepted answer:

Unless very custom parsing is needed, use parsedatetime.

Original answer

Not a solution because dateutil can parse points in time, but not intervals

dateutil now supports all of the original requested intervals:
from dateutil.parser import parse

examples = """
August 3rd, 2019
2019-08-03
2019, 3rd aug, 2:45 pm
"""

formatted_examples = [
    (example, f"{(p := parse(example))} <{p!r}>")
    for example in filter(None, examples.splitlines())
]
longest_example = max(map(lambda tup: len(tup[0]), formatted_examples))
longest_parsed = max(map(lambda tup: len(tup[1]), formatted_examples))

for example, parsed_example in formatted_examples:
    print(f"{example: <{longest_example}s} -> {parsed_example: >{longest_parsed}s}")

On PyPI, the package is called python-dateutil.

like image 129
Matthew Willcockson Avatar answered Oct 14 '22 23:10

Matthew Willcockson


how about pytimeparse lib

Returns the time as a number of seconds:

from pytimeparse.timeparse import timeparse
>>> timeparse('33m')
1980
>>> timeparse('2h33m')
9180
>>> timeparse('4:17')
257
>>> timeparse('5hr34m56s')
20096
>>> timeparse('1.2 minutes')
72

source seems to be here https://github.com/wroberts/pytimeparse

like image 33
Pavel T Avatar answered Oct 14 '22 22:10

Pavel T


This one is new to me, but based on some googling have you tried whoosh?

Edit: There's also parsedatetime:

#!/usr/bin/env python
from datetime import datetime
import parsedatetime as pdt # $ pip install parsedatetime

cal = pdt.Calendar()
for time_str in ['1 second', '2 minutes','3 hours','5 weeks','6 months','7 years']:
    diff = cal.parseDT(time_str, sourceTime=datetime.min)[0] - datetime.min
    print("{time_str:<10} -> {diff!s:>20} <{diff!r}>".format(**vars()))

Output

1 second   ->              0:00:01 <datetime.timedelta(0, 1)>
2 minutes  ->              0:02:00 <datetime.timedelta(0, 120)>
3 hours    ->              3:00:00 <datetime.timedelta(0, 10800)>
5 weeks    ->     35 days, 0:00:00 <datetime.timedelta(35)>
6 months   ->    181 days, 0:00:00 <datetime.timedelta(181)>
7 years    ->   2556 days, 0:00:00 <datetime.timedelta(2556)>
like image 31
Bobby W Avatar answered Oct 14 '22 23:10

Bobby W