Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In natural language processing (NLP), how do you make an efficient dimension reduction?

In NLP, it's always the case that the dimension of the features are very huge. For example, for one project at hand, the dimension of features is almost 20 thousands (p = 20,000), and each feature is a 0-1 integer to show whether a specific word or bi-gram is presented in a paper (one paper is a data point $x \in R^{p}$).

I know the redundancy among the features is huge, so dimension reduction is necessary. I have three questions:

1) I have 10 thousands data points (n = 10,000), and each data points has 10 thousands features (p = 10,000). What is the effieient way to conduct dimension reduction? The matrix $X \in R^{n \times p}$ is so huge that both PCA (or SVD, truncated SVD is OK, but I don't think SVD is a good way to reduce dimention for binary features) and Bag of Words (or K-means) is hard be be directly conducted on $X$ (Sure, it is sparse). I don't have a server, I just use my PC:-(.

2) How to judge the similarity or distance among two data points? I think the Euclidean distance may not work well for binary features. How about L0 norm? What do you use?

3) If I want to use SVM machine (or other kernel methods) to conduct classification, which kernel should I use?

Many Thanks!

like image 980
zxzx179 Avatar asked Nov 21 '14 00:11

zxzx179


People also ask

What is dimension reduction in NLP?

Dimensionality reduction is an unsupervised learning technique. Nevertheless, it can be used as a data transform pre-processing step for machine learning algorithms on classification and regression predictive modeling datasets with supervised learning algorithms.

What are the effective methods of dimension reduction?

Non-linear methods are well known as Manifold learning. Principal Component Analysis (PCA), Factor Analysis (FA), Linear Discriminant Analysis (LDA) and Truncated Singular Value Decomposition (SVD) are examples of linear dimensionality reduction methods.

Which of the following can be used to reduce the dimensions of data in NLP?

Choices A and B are correct because stopword removal will decrease the number of features in the matrix, normalization of words will also reduce redundant features, and, converting all words to lowercase will also decrease the dimensionality.


1 Answers

1) You don't need dimensionality reduction. If you really want, you can use an L1 penalized linear classifier to reduce to the most helpful features.

2) Cosine similarity is often used, or cosine similarity of the TFIDF rescaled vectors.

3) Linear SVMs work best with so many features.

There is a good tutorial on how to do classification like this in python here: http://scikit-learn.org/dev/tutorial/text_analytics/working_with_text_data.html

like image 195
Andreas Mueller Avatar answered Sep 28 '22 19:09

Andreas Mueller