Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python dateutil parser, ignore non-date part of string

I am using dateutil to parse picture filenames and sort them according to date. Since not all my pictures have metadata, dateutil is trying to guess where to put them.

Most of my pictures are in this format: 2007-09-10_0001.jpg 2007-09-10_0002.jpg etc...

fileName = os.path.splitext(file)[0]
print("Guesssing date from ", fileName)
try:
    dateString = dateParser.parse(file, fuzzy=True)
    print("Guessed date", dateString)
    year=dateString.year
    month = dateString.month
    day=dateString.day
except ValueError:
    print("Unable to determine date of ", file)

The return I am getting is this:

('Guesssing date from ', '2007-09-10_00005')
('Unable to determine date of ', '2007-09-10_00005.jpg')

Now I should be able to strip everything from after the underscore, but I wanted a more robust solution if possible, in case I have pictures in another format. I though fuzzy would try and find any date in the string and match to that, but apparently not working...

Is there an easy way to get the parser to find anything that looks like a date and stop after that? If not, what is the easiest way to force the parser to ignore everything after the underscore? Or a way to define multiple date formats with ignore sections.

Thanks!

like image 426
deranjer Avatar asked Jun 09 '13 16:06

deranjer


2 Answers

You can try to "reduce" the string as long as you can't decode it:

from dateutil import parser

def reduce_string(string):
    i = len(string) - 1
    while string[i] >= '0' and string[i] < '9':
        i -= 1
    while string[i] < '0' or string[i] > '9':
        i -= 1
    return string[:i + 1]

def find_date(string):
    while string:
        try:
            dateString = parser.parse(string, fuzzy=True)
            year = dateString.year
            month = dateString.month
            day = dateString.day
            return (year, month, day)
        except ValueError:
            pass

        string = reduce_string(string)

    return None

date = find_date('2007-09-10_00005')
if date:
    print date
else:
    print "can't decode"

The idea is to removing the end of the string (any numbers then any non-numbers) until the parser can decode it to a valid date.

like image 60
Guillaume Avatar answered Sep 28 '22 18:09

Guillaume


Commenting from the future here, as some more insight into this problem.

While dateutil's fuzzy search is pretty good at picking up dates in normal natural language, it fails at strings like the one above with tons of numeric/symbol related noise. With more recent versions of dateutil, however, when running:

>>> from dateutil.parser import parse
>>> parse('2007-09-10_00005.jpg', fuzzy=True)

parse fails with TypeError: 'NoneType' object is not iterable, which isn't very idiomatic.

Another alternative is simply seeking out the known date format using regex. Of course, this varies by use case, but OP mentioned that his date was always in the format YYYY-MM-DD, which makes it ideal for a regex search:

from dateutil.parser import parse
import re

date_pattern = re.compile('\d{4}-\d{2}-\d{2}')

def extract_date(filename):
    matches = re.match(date_pattern, filename)
    if matches:
        return parse(matches.group(0))
    else:
        return None

extract_date('2007-09-10_00005.jpg')  # datetime.datetime(2007, 9, 10, 0, 0)
like image 36
jayelm Avatar answered Sep 28 '22 18:09

jayelm