Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

References for data normalization

What are the best practices on normalizing data (not sure if that is the right term) for NNs and other machine learning algorithms? What I mean is how you represent data to the NN/algo.

For instance, how do you represent a store code? Store 555 isn't greater or less than 554, it is just a classification. Do NNs/algo models just filter that out on their own or do you need to prod them into making a classification rather mathematical distinction?

Thanks for any help in directing me to appropriate information. I am obviously new to this.

EDIT: Thanks to everyone for the answers. I have been digging through quite a few data mining books and while I have found a few that spend a chapter or two on the topic of data pre-processing I am a little surprised at how most gloss over it entirely. Thanks again.

like image 988
ValenceElectron Avatar asked Apr 13 '11 16:04

ValenceElectron


2 Answers

I have never found anything approaching a comprehensive resource on the topic of 'data pre-processing'.

Your question is directed to the essential predicate step in Machine Learning of identifying each variable (variables just refer to the Fields in your SQL Tables or the columns in your data matrix) in your data as either continuous or discrete. Discrete variables are also referred to as Factors and Categorical variables. (There is a third type, time (usually a specialized data type in your language of choice) which is a genuine hybrid between the first two.)

One source i can recommend is (by no means the best or even the best i have read, rather it's just a title i can recall from memory and that describes the issue at hand reasonably well and provides some educated guidance):

Statistics Nutshell by Sarah Boslaugh & Paul Andrew Watters, O'Reilly Eds. (Ch 10 Categorical Variables)

Discrete/Categorical Variables

Categorical variables ('Factors' in R) are variables like Sex (values: male/female), State of Residence (e.g., Vermont, Idaho, etc.), Eye Color, and from your Question, Store Number. The Store Number might be 555, but you should probably record that as a string rather than an integer or a float (i.e., so that the algorithm treats the value 555 just as if they were "five fifty five". If you are doing your work on a statistics platform (e.g., SAS, SPSS, R) then the platform will provide specific guidance. In R for instance, it's common to set Store Number as a factor when you import the data.

The distinction between continuous variables and factors is absolutely essential for just about any analytical work because it determines (i) the analytical operations you can run against your data; and (ii) the type of predictive algorithm you can use.

W/r/t the first item, cross-tabulation (the function xtabs in R) is a common analytical operation that you can perform only on factors. (note: it's referred to as a contingency table if instead of raw counts, percents are recorded.) Imagine that you have a data set comprised of rows from a server access log aggregated so that one row is one user in one session. Suppose that you have configured the log to record, among other things, referral URL and browser type. A cross-tab of these two variables just shows the frequencies users for all combinations each value of each variable. So if there three referral URLs in the data and four browser types, the resulting table would have 12 cells. Again, cross-tabulation is only possible for discrete variables.

The other reason to distinguish variables into discrete and continuous is so that you can select and/or configure your Machine Learning algorithm in accord with whether your response variable (the one you are trying to predict) is discrete or continuous.

An orthogonal classification of variable types (again, i'm referring to columns in a data set) is measured versus response (sometimes independent versus dependent). So for instance, you record various session details for each unregistered visitor to your Site, such as which pages viewed, total pages viewed, total time per page, inbound referral link, outbound link, etc.--those are all measured variables. And one reason to measure these is to predict whether that new user will eventually register and if so will they sign up for the premium service. Those are the response variables.

In this instance, the response variables might be Registered User and Premium Subscriber and the values for both are either yes or no, which makes this a discrete variable.

When your response variable--the thing you are trying to predict--is a factor/discrete variable, you have a classification problem. What's returned by your Machine Learning algorithm is a class label (e.g., registered user or 'not r/u).

If on the other hand, your response variable is continuous (let's say you wanted to predict expected lifetime value, as total amount wagered, for a new customer on your sports-betting site) then your problem is not classification but regression. In other words, your algorithm must return a value, usually a float.

Many Machine Learning algorithms, including Neural Networks, which you mentioned in your Question (also e.g., Support Vector Machines, and KNN), can be easily configured to run in either mode--classification or regression.

Continuous Variables

Continuous variables are things like time (in seconds), number of login sessions per user, weight, age, total calories consumed, etc.--things expressed with floats or less often with integers and incremented accordingly (i.e., 1 second more than 56 seconds is 57 seconds).

Dealing with these (once you have determined which variables in your data set are in fact continuous) involves usually just the step referred to confusingly as either normalizing, scaling, or standardizing. While they are used interchangeably in practice, they actually refer to separate transformations justified by separate circumstances.

Use or don't use these terms however you wish, though separating the three might help reconcile all of those techniques you see in the literature or that are used in practice.

  • Rescaling: for instance, to change the unit of measure; to rescale you add/subtract a constant then multiply/divide by another constant. This is easier to show than describe, e.g., to convert from Celsius to Fahrenheit, you add 32 to the Celsius temperature then multiply that value by 9/5;

  • Normalizing: dividing by the norm. So for instance, if one of the rows in your data set is [1.23, 2.21, 0.84, 3.54, 1.90], then diving it element-wise by its norm (which is about 4.8 in this case). When you do that, the normalized row you get is [0.255, 0.458, 0.174, 0.734, 0.39]. If you use Python+NumPy, then the expression is normalized_row1 = row1 / LA.norm(row1), with the predicate import statement import numpy.linalg as LA');

  • Standardizing: refers to a two-step process of subtraction and division, e.g., to get a variable in 'standard normal' form, you subtract the mean and divide by the standard deviation, after which your random variable has a mean of 0 and an SD of 1.

like image 167
doug Avatar answered Sep 28 '22 08:09

doug


Usually you will need to specify the level of measurement, as well as the role of the variable (independent, dependent, input, output etc.). Sometimes the package will make a "guess" and you have the option of changing it. In your example, Store is a classification variable. Even though is a number, you can't do arithmetic on it.

http://en.wikipedia.org/wiki/Level_of_measurement

like image 24
Ralph Winters Avatar answered Sep 28 '22 08:09

Ralph Winters