I'm trying to work out how to implement some machine learning library to help me find out what the correct weighting for each parameter is in order to make a good decision.
In more detail:
Context: trying to implement a date of publication extractor for html files. This is for news sites, so I don't have a generic date format that I can use. I'm using the parser in dateutil in python, which does a pretty good job. I end up with a list of possible publication dates (all the dates in the html file).
From a set of parameters, such as close tags, words close to the date substring, etc. I sort the list according to likelihood of being the publication date. The weighting for each parameter are somehow educated guesses.
I would like to implement a machine learning algorithm that, after a training period (in which the actual publication date is provided), it determines what the weighting for each parameter should be.
I've been reading the documentation of different machine learning libraries in python (pyML, scikit-learn, pybrain), but I haven't found anything useful. I've also read this and there's a close example with determining if a mushroom is eadible or not.
Note: I'm working in python.
I would very much appreciate your help.
Given your problem description, the characteristics of yoru data, and your ML background and personal preferences, i would recommend Orange.
Orange is a mature, free and open source project with a large selection of ML algorithms and excellent documentation and training materials. Most users probably use the GUI supplied with Orange, but the framework is scriptable with Python.
Using this framework therefore, will of course enable you to quickly experiment with a variety of classifiers because (i) they are all in one place; and (ii) each is accessed a common configuration syntax GUI. All of the ML techniques within the Orange framework can be run in "demo" mode one or more sample data sets supplied with the Orange install. The documentation supplied in the Orange Install is excellent. In addition, the Home Page includes links to numerous tutorials that cover probably every ML technique included in the framework.
Given your problem, perhaps begin with a Decision Tree algorithm (either C4.5 or ID3 implementation). A fairly recent edition of Dr. Dobbs Journal (online) includes an excellent article on using decision trees; the use case is web server data (from the server access log).
Orange has a C4.5 implementation, available from the GUI (as a "widget"). If that's too easy, about 100 lines is all it takes to code one in python. Here's the source for a working implementation in that language
I recommend starting with a Decision Tree for several reasons.
If it works on your data, you will not only have a trained classifier, but you will also have a visual representation of the entire classification schema (represented as a binary tree). Decision Trees are (probably) unique among ML techniques in this respect.
The characteristics of your data are aligned with the optimal performance scenario of C4.5; the data can be either categorical or continuous variables (though this technique performs better with if more features (columns/fields) discrete rather than continuous, which seems to describe your data); also Decision Tree algorithms can accept, without any pre-processing, incomplete data points
Simple data pre-processing. The data fed to a decision tree algorithm does not require as much data pre-processing as most other ML techniques; pre-processing is often (usually?) the most time-consuming task in the entire ML workflow. It's also sparsely documented, so it's probably also the most likely source of error.
You can deduce the (relative) weight of each variable from each node's distance from the root--in other words, from a quick visual inspection of the trained classifier. Recall that the trained classifier is a just a binary tree (and is often rendered this way) in which the nodes correspond to one value of one feature (variable, or column in your data set); the two edges joined to that node of course represent the data points split into two groups based on each point's value for that feature (e.g., if the feature is the categorical variable "Publication Date in HTML Page Head?", then through the left edge will flow all data points in which the publication date is not within the opening and closing head tags, and the right node gets the other group). What is the significance of this? Since a node just represents a state or value for a particular variable, that variable's importance (or weight) in classifying the data can be deduced from its position in the tree--i.e., the closer it is to the root node, the more important it is.
From your Question, it seems you have two tasks to complete before you can feed your training data to a ML classifier.
I. identify plausible class labels
What you want to predict is a date. Unless your resolution requirements are unusually strict (e.g., resolved to a single date) i would build a classification model (which returns a class label given a data point) rather than a regression model (returns a single continuous value).
Given that your response variable is a date, a straightforward approach is to set the earliest date to the baseline, 0, then represent all other dates as an integer value that represents the distance from that baseline. Next, discretize all dates into a small number of ranges. One very simple technique for doing this is to calculate the five summary descriptive statistics for your response variable (min, 1st quartile, mean, 3rd quartile, and max). From these five statistics, you get four sensibly chosen date ranges (though probably not of equal span or of equal membership size.
These four ranges of date values then represent your class labels--so for instance, classI might be all data points (web pages, i suppose) whose response variable (publication date) is 0 to 10 days after 0; classII is 11 days after 0 to 25 days after 0, etc.
[Note: added the code below in light of the OP's comment below this answer, requesting clarification.]
# suppose these are publication dates
>>> pd0 = "04-09-2011"
>>> pd1 = "17-05-2010"
# convert them to python datetime instances, e.g.,
>>> pd0 = datetime.strptime(pd0, "%d-%m-%Y")
# gather them in a python list and then call sort on that list:
>>> pd_all = [pd0, pd1, pd2, pd3, ...]
>>> pd_all.sort()
# 'sort' will perform an in-place sort on the list of datetime objects,
# such that the eariest date is at index 0, etc.
# now the first item in that list is of course the earliest publication date
>>> pd_all[0]
datetime.datetime(2010, 5, 17, 0, 0)
# express all dates except the earliest one as the absolute differenece in days
# from that earliest date
>>> td0 = pd_all[1] - pd_all[0] # t0 is a timedelta object
>>> td0
datetime.timedelta(475)
# convert the time deltas to integers:
>>> fnx = lambda v : int(str(v).split()[0])
>>> time_deltas = [td0,....]
# d is jsut a python list of integers representing number of days from a common baseline date
>>> d = map(fnx, time_deltas)
II. convert your raw data to an "ML-useable" form.
For a C4.5 classifier, this task is far simpler and requires fewer steps than for probably every other ML algorithm. What's preferred here is to discretize to a relatively small number of values, as many of your parameters as possible--e.g., if one of your parameters/variables is "distance of the publication date string from the closing body tag", then i would suggest discretizing those values into ranges, as marketing surveys often ask participants to report their age in one of a specified set of spans (18 - 35; 36 - 50, etc.) rather than as a single integer (41).
Assuming you need machine learning (document set is sufficiently large, number of news sites is large enough that writing parsers on a per-site basis is unwieldy, URLs don't contain any obvious publication date markers, HTTP Last-Modified headers are unreliable, etc.) - you might consider an approach like:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With