Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm to handle data aggregation from multiple error-prone sources

I'm aggregating concert listings from several different sources, none of which are both complete and accurate. Some of the data comes from users (such as on last.fm), and may be incorrect. Other data sources are highly accurate, but may not contain every event. I can use attributes such as the event date, and the city/state to try to match listings from disparate sources. I'd like to be reasonably certain that the events are valid. It seems like it would be a good strategy to consume as many different sources as possible to validate listings on error-prone sources.

I'm not sure what the technical term for this is, as I'd like to research it further. Is it data mining? Are there any existing algorithms? I understand a solution will never be completely accurate.

like image 792
Matt Green Avatar asked May 25 '11 03:05

Matt Green


1 Answers

Here is an approach that locates it within statistics - specifically, it uses a Hidden Markov Model (http://en.wikipedia.org/wiki/Hidden_Markov_model):

1) Use your matching process to produce a cleaned list of possible events. Consider each event to be marked "true" or "bogus", even though the markings are hidden from you. You might imagine that some source of events produces them, generating them as either "true" or "bogus" according to a probability which is an unknown parameter.

2) Associate unknown parameters with each source of listings. These give the probability that this source will report a true event produced by the source of events, and the probability that it will report a bogus event produced by the source.

3) Notice that if you could see the markings of "true" or "bogus" you could easily work out the probabilities for each source. Unfortunately, of course, you can't see these hidden markings.

4) Let's call these hidden markings "Latent Variables" because then you can use the http://en.wikipedia.org/wiki/Em_algorithm to hillclimb to promising solutions for this problem, from random starts.

5) You can obviously make the problem more complicated by dividing events up into classes, and giving sources of listing parameters which make them more likely to report some classes of events than others. This might be useful if you have sources that are extremely reliable for some sorts of events.

like image 127
mcdowella Avatar answered Oct 21 '22 21:10

mcdowella