Predicting missing data values in a database

Tags:

I have a database, consisting of a whole bunch of records (around 600,000) where some of the records have certain fields missing. My goal is to find a way to predict what the missing data values should be (so I can fill them in) based on the existing data.

One option I am looking at is clustering - i.e. representing the records that are all complete as points in some space, looking for clusters of points, and then when given a record with missing data values try to find out if there are any clusters that could belong in that are consistent with the existing data values. However this may not be possible because some of the data fields are on a nominal scale (e.g. color) and thus can't be put in order.

Another idea I had is to create some sort of probabilistic model that would predict the data, train it on the existing data, and then use it to extrapolate.

What algorithms are available for doing the above, and is there any freely available software that implements those algorithms (This software is going to be in c# by the way).

959

asked Jul 23 '09 17:07

Alex319

2 Answers

This is less of an algorithmic and more of a philosophical and methodological question. There are a few different techniques available to tackle this kind of question. Acock (2005) gives a good introduction to some of the methods. Although it may seem that there is a lot of math/statistics involved (and may seem like a lot of effort), it's worth thinking what would happen if you messed up.

Andrew Gelman's blog is also a good resource, although the search functionality on his blog leaves something to be desired...

Hope this helps.

Acock (2005)

http://oregonstate.edu/~acock/growth-curves/working%20with%20missing%20values.pdf

Andrew Gelman's blog

http://www.stat.columbia.edu/~cook/movabletype/mlm/

163

answered Sep 22 '22 10:09

David Lawrence Miller

Dealing with missing values is a methodical question that has to do with the actual meaning of the data.

Several methods you can use (detailed post on my blog):

Ignore the data row. This is usually done when the class label is missing (assuming you data mining goal is classification), or many attributes are missing from the row (not just one). However you'll obviously get poor performance if the percentage of such rows is high
Use a global constant to fill in for missing values. Like "unknown", "N/A" or minus infinity. This is used because sometimes is just doesnt make sense to try and predict the missing value. For example if you have a DB if, say, college candidates and state of residence is missing for some, filling it in doesn't make much sense...
Use attribute mean. For example if the average income of a US family is X you can use that value to replace missing income values.
Use attribute mean for all samples belonging to the same class. Lets say you have a cars pricing DB that, among other things, classifies cars to "Luxury" and "Low budget" and you're dealing with missing values in the cost field. Replacing missing cost of a luxury car with the average cost of all luxury cars is probably more accurate then the value you'd get if you factor in the low budget cars
Use data mining algorithm to predict the value. The value can be determined using regression, inference based tools using Baysian formalism , decision trees, clustering algorithms used to generate input for step method #4 (K-Mean\Median etc.) I'd suggest looking into regression and decision trees first (ID3 tree generation) as they're relatively easy and there are plenty of examples on the net.

As for packages, if you can afford it and you're in the Microsoft world look at SQL Server Analysis Services (SSAS for short) that implement most of the mentioned above.

Here are some links to free data minning software packages:

WEKA - http://www.cs.waikato.ac.nz/ml/weka/index.html
ORANGE - http://www.ailab.si/orange
TANAGRA - http://eric.univ-lyon2.fr/~ricco/tanagra/en/tanagra.html

Although not C# he's a pretty good intro to decision trees and baysian learning (using Ruby): http://www.igvita.com/2007/04/16/decision-tree-learning-in-ruby/ http://www.igvita.com/2007/05/23/bayes-classification-in-ruby/

There's also this Ruby library that I find very useful (also for learning purposes): http://ai4r.rubyforge.org/machineLearning.html

There should be plenty of samples for these algorithms online in any language so I'm sure you'll easily find C# stuff too...

Edited:

Forgot this in my original post. This is a definately MUST HAVE if you're playing with data mining... Download Microsoft SQL Server 2008 Data Mining Add-ins for Microsoft Office 2007 (It requires SQL Server Analysis Services - SSAS - which isn't free but you can download a trial).

This will allow you to easily play and try out the different techniques in Excel before you go and implement this stuff yourself. Then again, since you're in the Microsoft ecosystem, you might even decide to go for an SSAS based solution and count on the SQL Server guys to do it for ya :)

answered Sep 19 '22 10:09

Eran Kampf

Related questions
                            
                                Haskell's quicksort - what is it really? [duplicate]
                            
                                Job assignment with NO cost, would Hungarian method work?
                            
                                The Movie Scheduling _Problem_
                            
                                Transcribing a polygon on a circle
                            
                                Insert sorted array into binary search tree
                            
                                Finding the ranking of a word (permutations) with duplicate letters
                            
                                Algorithm to solve for water accumulation given building heights
                            
                                Total number of palindromic subsequences in a string
                            
                                How to change max element in a heap in C++ standard library?
                            
                                Best way to find position of element in unsorted array after it gets sorted
                            
                                Get a given weekday in a given month with JavaScript
                            
                                How to remove sequential matches in vector in Clojure?
                            
                                How can I improve spell check time in C program?
                            
                                Better alternative to lots of IF statements? Table of values
                            
                                How to generate a sequence of numbers while respecting some constraints?
                            
                                Is it safe to traverse a container during std::remove_if execution?
                            
                                What is the fastest way to see if an array has two common elements?
                            
                                On Path Finding: a detailed description for a layman of the D* algorithm
                            
                                firefox cache hash key generation algorithm bug
                            
                                Calculating the Bounding Rectangle at an Angle of a Polygon

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Predicting missing data values in a database

Tags:

algorithm

math

statistics

Alex319

People also ask

2 Answers

David Lawrence Miller

Eran Kampf

Recent Activity

Donate For Us