Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python - A way to learn and detect text patterns?

Problem:

I am given a long list of various position titles for jobs in the IT industry (support or development); I need to automatically categorize them based on the general type of job they represent. For example, IT-support analyst, help desk analyst... etc. Could all belong to the group IT-Support.

Current Approach:

Currently, I am manually building regex patterns to accomplish this, which change as I encounter new titles which should be included in a group. For example, I originally used the pattern:

"(HELP|SERVICE) DESK"

to match IT-Support type jobs, and this eventually became:

"(HELP|SUPPORT|SERVICE) (DESK|ANALYST)"

which was even more inclusive.

Question:

I feel like there should be a fairly intuitive way to automatically build these regex patterns with some sort of algorithm, but I have no idea how this might work... I've read about NLP briefly in the past, but its extremely alien to me... Any suggestions on how I might implement such an algorithm with/without NLP?

EDIT:

I'm considering using a decision tree, but it has some limitations which prevent it from working (in this situation) "out-of-the-box"; for example, if I have built the following tree:

(Service)->(Desk)->(Support) OR ->(Analyst) ...where Support and Analyst are both children of Desk

Say I get the string "Level-1 Service Desk Analyst"... This should be categorized using the decision tree above, but it will not inherantly match the tree (since there is no root node named "Level" or "Level-1").

I believe I am heading in the right direction now, but I need additional logic. For example, if I am given the following hypothetical strings:

  1. IT Service Desk Analyst
  2. Level-1 Help Desk Analyst
  3. Computer Service Desk Support

I would like my algorithm to create something like below:

(Service OR Help)->(Desk)->(Analyst OR Support) ...where Service and Help are both root nodes, and both Analyst and Support are children of Desk

Basically, I need the following: I would like this matching algorithm to be able to reduce the strings it is presented with to a minimal number of sub-strings which effectively match all of the strings in a given cateogory (preferably using a decision tree).

If I am not being clear enough, just let me know!

like image 204
araisbec Avatar asked Feb 01 '14 16:02

araisbec


Video Answer


1 Answers

Well, setting a bounty allowed me to learn a lot of new material surrounding this topic, but ultimately I am answering my own question.

I have decided to go with the Pattern module for Python, use a Naive-Bayes classifier.

As the user manually classifies positions, a csv file is generated one line at a time:

"Help Desk Analyst", "Help Desk" "Service Desk", "Help Desk", "Jr. Java Developer", "Java Development" ...etc.

My algorithm looks like this (taken from http://www.clips.ua.ac.be/pages/pattern-vector#classification):

>>> from pattern.vector import Document, NB
>>> from pattern.db import csv
>>>  
>>> nb = NB()
>>> for review, rating in csv('reviews.csv'):
>>>     v = Document(review, type=int(rating), stopwords=True) 
>>>     nb.train(v)
>>> 
>>> print nb.classes
>>> print nb.classify(Document('A good movie!'))

...Where review and rating are position_text and position_group respectively. Classifier data is saved from one search (and execution of the program) to the next.

Each time the user searches, the algorithm is run (with all previous classifications being taken into account), and the program classifies the positions that are returned with its best guesses. Obviously, the more positions are grouped, the more accurate these guesses become.

The next step that I will implement to make this more robust will be to upload user classification data to a central server, which all instances of this software can download from automatically. This way, every user (who willingly contributes data to the project) will contribute to training this software's classification system, and over time, it will become very robust.

like image 171
araisbec Avatar answered Oct 20 '22 00:10

araisbec