Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does dimensionality reduction mean?

What does dimensionality reduction mean exactly?

I searched for its meaning, I just found that it means the transformation of raw data into a more useful form. So what is the benefit of having data in useful form, I mean how can I use it in a practical life (application)?

like image 500
Yasmeen Avatar asked Jan 03 '10 09:01

Yasmeen


3 Answers

Dimensionality Reduction is about converting data of very high dimensionality into data of much lower dimensionality such that each of the lower dimensions convey much more information.

This is typically done while solving machine learning problems to get better features for a classification or regression task.

Heres a contrived example - Suppose you have a list of 100 movies and 1000 people and for each person, you know whether they like or dislike each of the 100 movies. So for each instance (which in this case means each person) you have a binary vector of length 100 [position i is 0 if that person dislikes the i'th movie, 1 otherwise ].
You can perform your machine learning task on these vectors directly.. but instead you could decide upon 5 genres of movies and using the data you already have, figure out whether the person likes or dislikes the entire genre and, in this way reduce your data from a vector of size 100 into a vector of size 5 [position i is 1 if the person likes genre i]

The vector of length 5 can be thought of as a good representative of the vector of length 100 because most people might be liking movies only in their preferred genres.

However its not going to be an exact representative because there might be cases where a person hates all movies of a genre except one.

The point is, that the reduced vector conveys most of the information in the larger one while consuming a lot less space and being faster to compute with.

like image 86
Aditya Mukherji Avatar answered Nov 20 '22 09:11

Aditya Mukherji


You're question is a little vague, but there's an interesting statistical technique that may be what you're thinking off called Principal Component Analysis which does something similar (and incidentally plotting the results from which was my first real world programming task)

It's a neat, but clever technique which is remarkably widely applicable. I applied it to similarities between protein amino acid sequences, but I've seen it used for analysis everything from relationships between bacteria to malt whisky.

Consider a graph of some attributes of a collection of things where one has two independent variables - to analyse the relationship on these one obviously plots on two dimensions and you might see a scatter of points. if you've three variable you can use a 3D graph, but after that one starts to run out of dimensions.

In PCA one might have dozens or even a hundred or more independent factors, all of which need to be plotted on perpendicular axis. Using PCA one does this, then analyses the resultant multidimensional graph to find the set of two or three axis within the graph which contain the largest amount of information. For example the first Principal Coordinate will be a composite axis (i.e. at some angle through n-dimensional space) which has the most information when the points are plotted along it. The second axis is perpendicular to this (remember this is n-dimensional space, so there's a lot of perpendiculars) which contains the second largest amount of information etc.

Plotting the resultant graph in 2D or 3D will typically give you a visualization of the data which contains a significant amount of the information in the original dataset. It's usual for the technique to be considered valid to be looking for a representation that contains around 70% of the original data - enough to visualize relationships with some confidence that would otherwise not be apparent in the raw statistics. Notice that the technique requires that all factors have the same weight, but given that it's an extremely widely applicable method that deserves to be more widely know and is available in most statistical packages (I did my work on an ICL 2700 in 1980 - which is about as powerful as an iPhone)

like image 8
Cruachan Avatar answered Nov 20 '22 07:11

Cruachan


http://en.wikipedia.org/wiki/Dimension_reduction

maybe you have heard of PCA (principle component analysis), which is a Dimension reduction algorithm.

Others include LDA, matrix factorization based methods, etc.

Here's a simple example. You have a lot of text files and each file consists some words. There files can be classified into two categories. You want to visualize a file as a point in a 2D/3D space so that you can see the distribution clearly. So you need to do dimension reduction to transfer a file containing a lot of words into only 2 or 3 dimensions.

like image 2
Yin Zhu Avatar answered Nov 20 '22 09:11

Yin Zhu