Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to improve performance of text parsing function?

I am trying to parse the vintage year from the titles of wines. I look to be getting around 50% accuracy with the code below but would like to improve this if possible. Does anybody know what I can do to improve accuracy?

Example titles and their parsed year being returned:

Quinta dos Avidagos 2011 Avidagos Red (Douro) -> 0 incorrect
Rainstorm 2013 Pinot Gris (Willamette Valley) -> 2011 incorrect
Louis M. Martini 2012 Cabernet Sauvignon -> 2012 correct
Mirassou 2012 Chardonnay (Central Coast) -> 2012 correct

Code I am implementing:

from dateutil.parser import parse
from datetime import datetime, timezone

df = "my pandas dataframe with wine titles"
dt = datetime.now()
dt.replace(tzinfo=timezone.utc)

year_parse = []
for i in range(len(df['title'])):
    try:
        ans = parse(df.title[i], fuzzy=True).year
        year_parse.append(int(ans))
    except:
        ans = 0
        year_parse.append(int(ans))

Very grateful for any suggestions!

like image 805
plunderbuss Avatar asked Dec 08 '25 06:12

plunderbuss


1 Answers

You can use regex for this. I am hoping that wine name has same pattern .

import re
exp = re.compile(r'\d{4}')
year_parse = list()
for name in df['title']:
      year = exp.findall(name)[0]
      year_parse.append(year)

year_parse got all the year in a list.

like image 145
sahasrara62 Avatar answered Dec 09 '25 20:12

sahasrara62



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!