Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Help: Extracting data tuples from text... Regex or Machine learning?

I would really appreciate your thoughts on the best approach to the following problem. I am using a Car Classified listing example which is similar in nature to give an idea.

Problem: Extract a data tuple from the given text.

Here are some characteristics of the data.

  1. The vocabulary (words) in the text is limited to a specific domain. Lets assume 100-200 words at the most.

  2. Text that needs to be parsed is a headline like a Car Ad data shown below. So each record corresponds to one tuple (row).

  3. In some cases some of the attributes may be missing. So for example, in raw data row #5 below the year is missing.

  4. Some words go together (bigrams). Like "Low miles".

  5. Historical data available = 10,000 records

  6. Incoming New Data volume = 1000-1500 records / week

The expected output should be in the form of (Year,Make,Model, feature). So the output should look like

1 -> (2009, Ford, Fusion, SE)
2 -> (1997, Ford, Taurus, Wagon)
3 -> (2000, Mitsubishi, Mirage, DE)
4 -> (2007, Ford, Expedition, EL Limited)
5 -> ( , Honda, Accord, EX)
....
....

Raw Headline Data:


1 -> 2009 Ford Fusion SE - $7000
2 -> 1997 Ford Taurus Wagon - $800 (san jose east)
3 -> '00 Mitsubishi Mirage DE - $2499 (saratoga) pic
4 -> 2007 Ford Expedition EL Limited - $7800 (x)
5 -> Honda Accord ex low miles - $2800 (dublin / pleasanton / livermore) pic
6 -> 2004 HONDA ODASSEY LX 68K MILES - $10800 (danville / san ramon)
7 -> 93 LINCOLN MARK - $2000 (oakland east) pic
8 -> #######2006 LEXUS GS 430 BLACK ON BLACK 114KMI ####### - $19700 (san rafael) pic
9 -> 2004 Audi A4 1.8T FWD - $8900 (Sacramento) pic
10 -> #######2003 GMC C2500 HD EX-CAB 6.0 V8 EFI WHITE 4X4 ####### - $10575 (san rafael) pic
11 -> 1990 Toyota Corolla RUNS GOOD! GAS SAVER! 5SPEED CLEAN! REG 2011 O.B.O - $1600 (hayward / castro valley) pic img
12 -> HONDA ACCORD EX 2000 - $4900 (dublin / pleasanton / livermore) pic
13 -> 2009 Chevy Silverado LT Crew Cab - $23900 (dublin / pleasanton / livermore) pic
14 -> 2010 Acura TSX - V6 - TECH - $29900 (dublin / pleasanton / livermore) pic
15 -> 2003 Nissan Altima - $1830 (SF) pic


Possible choices:

  1. A machine learning Text Classifier (Naive Bayes etc)
  2. Regex

What I am trying to figure out is if RegEx is too complicated for the job and a Text classifier is an overkill?

If the choice is to go with a text classifier then what would you consider to be the easiest to implement.

Thanks in advance for your kind help.

like image 495
Cyber Student Avatar asked Jun 12 '11 18:06

Cyber Student


2 Answers

This is a well studied problem called information extraction. It is not straight forward to do what you want to do, and it is not as simple as you make it sound (ie machine learning is not an overkill). There are several techniques, you should read an overview of the research area.

like image 198
carlosdc Avatar answered Oct 28 '22 09:10

carlosdc


Check this IE library for writing extraction rule< I think it will work best for you problem. There also example how to create fast dictionary matching.

like image 22
yura Avatar answered Oct 28 '22 10:10

yura