I am writing an ETL script in Python that gets data in CSV files, validates and sanitizes the data as well as categorizes or classifies each row according to some rules, and finally loads it into a postgresql database.
The data looks like this (simplified):
ColA, ColB, Timestamp, Timestamp, Journaltext, AmountA, AmountB
Each row is a financial transaction. What I want to do is to categorize or classify transactions based on some rules. The rules are basically regular expressions that match the text in Journaltext column.
So what I want to do is something like this:
transactions = [] for row in rows: t = Transaction(category=classify(row.journaltext)) transactions.append(t)
I am not sure how to write the classify() function efficiently.
This is how the rules for classification works:
Ok. So how to I represent these categories and corresponding rules in Python?
I would really appreciate your input. Even if you cannot provide a full solution. Just anything to hint me in the right direction will be great. Thanks.
what about this solution in pseudo python:
def classify(journaltext):
prio_list = ["FOO", "BAR", "UPS", ...] # "..." is a placeholder: you have to give the full list here.
# dictionary:
# - key is the name of the category, must match the name in the above prio_list
# - value is the regex that identifies the category
matchers = {"FOO": "the regex for FOO", "BAR": "the regex for BAR", "UPS":"...", ...}
for category in prio_list:
if re.match(matchers[category], journaltext):
return category
return "UNKOWN" # or you can "return None"
Features:
You even can read the prioritized category list and the regexs from a configuration file, but this is left as an exercise to the reader...
Without any kind of extra fluff:
categories = [
('cat1', ['foo']),
('cat2', ['football']),
('cat3', ['abc', 'aba', 'bca'])
]
def classify(text):
for category, matches in categories:
if any(match in text for match in matches):
return category
return None
In Python you can use the in
operator to test for subsets of a string. You could add some things like isinstance(match, str)
to check whether you're using a simple string, or a regular expressions object. How advanced it becomes is up to you.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With