Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to classify/categorize strings according to regular expression rules in Python

I am writing an ETL script in Python that gets data in CSV files, validates and sanitizes the data as well as categorizes or classifies each row according to some rules, and finally loads it into a postgresql database.

The data looks like this (simplified):

ColA, ColB, Timestamp, Timestamp, Journaltext, AmountA, AmountB

Each row is a financial transaction. What I want to do is to categorize or classify transactions based on some rules. The rules are basically regular expressions that match the text in Journaltext column.

So what I want to do is something like this:

transactions = []
for row in rows:
    t = Transaction(category=classify(row.journaltext))
    transactions.append(t)

I am not sure how to write the classify() function efficiently.

This is how the rules for classification works:

  • There are a number of categories (more can and will be added later)
  • Each category has a set of substrings or regular expressions that, if Journaltext of a transaction matches this expression or contains this substring, then this transaction belongs to this category.
  • A transaction can only be on one category
  • If a category, FOO, has substrings 'foo' and 'Foo', and another category BAR has substrings 'football', then a transaction with Journaltext='food' must be put in category FOO, because it only matches FOO, but a transaction with Journaltext='footballs' must be placed in category BAR. I think this means that I have to put a priority or similar on each category.
  • If a transaction does not match any of the expressions, it is either None in category or will be put in a placeholder category called "UNKNOWN" or similar. This does not matter much.

Ok. So how to I represent these categories and corresponding rules in Python?

I would really appreciate your input. Even if you cannot provide a full solution. Just anything to hint me in the right direction will be great. Thanks.

like image 701
ervingsb Avatar asked Mar 08 '12 19:03

ervingsb


2 Answers

what about this solution in pseudo python:

def classify(journaltext):
    prio_list = ["FOO", "BAR", "UPS", ...] # "..." is a placeholder: you have to give the full list here.
    # dictionary: 
    # - key is the name of the category, must match the name in the above prio_list
    # - value is the regex that identifies the category
    matchers = {"FOO": "the regex for FOO", "BAR": "the regex for BAR", "UPS":"...", ...}
    for category in prio_list:
        if re.match(matchers[category], journaltext):
            return category
    return "UNKOWN" # or you can "return None"

Features:

  • this has a prio_list, which is all the categories in descending order.
  • it tries to match in the order of the list.
  • It is matched against a regex from the matchers dictionary. So the category names can be arbitrary.
  • the function returns the name of the category
  • if nothing matches, then you get your placeholder category name.

You even can read the prioritized category list and the regexs from a configuration file, but this is left as an exercise to the reader...

like image 155
Jörg Beyer Avatar answered Sep 29 '22 01:09

Jörg Beyer


Without any kind of extra fluff:

categories = [
  ('cat1', ['foo']),
  ('cat2', ['football']),
  ('cat3', ['abc', 'aba', 'bca'])
]

def classify(text):
  for category, matches in categories:
    if any(match in text for match in matches):
      return category
  return None

In Python you can use the in operator to test for subsets of a string. You could add some things like isinstance(match, str) to check whether you're using a simple string, or a regular expressions object. How advanced it becomes is up to you.

like image 38
g.d.d.c Avatar answered Sep 29 '22 01:09

g.d.d.c