In natural language processing (NLP), how do you make an efficient dimension reduction?

Tags:

In NLP, it's always the case that the dimension of the features are very huge. For example, for one project at hand, the dimension of features is almost 20 thousands (p = 20,000), and each feature is a 0-1 integer to show whether a specific word or bi-gram is presented in a paper (one paper is a data point $x \in R^{p}$).

I know the redundancy among the features is huge, so dimension reduction is necessary. I have three questions:

1) I have 10 thousands data points (n = 10,000), and each data points has 10 thousands features (p = 10,000). What is the effieient way to conduct dimension reduction? The matrix $X \in R^{n \times p}$ is so huge that both PCA (or SVD, truncated SVD is OK, but I don't think SVD is a good way to reduce dimention for binary features) and Bag of Words (or K-means) is hard be be directly conducted on $X$ (Sure, it is sparse). I don't have a server, I just use my PC:-(.

2) How to judge the similarity or distance among two data points? I think the Euclidean distance may not work well for binary features. How about L0 norm? What do you use?

3) If I want to use SVM machine (or other kernel methods) to conduct classification, which kernel should I use?

Many Thanks!

980

asked Nov 21 '14 00:11

zxzx179

1 Answers

1) You don't need dimensionality reduction. If you really want, you can use an L1 penalized linear classifier to reduce to the most helpful features.

2) Cosine similarity is often used, or cosine similarity of the TFIDF rescaled vectors.

3) Linear SVMs work best with so many features.

There is a good tutorial on how to do classification like this in python here: http://scikit-learn.org/dev/tutorial/text_analytics/working_with_text_data.html

195

answered Sep 28 '22 19:09

Andreas Mueller

Related questions
                            
                                Trembling text on iPhone
                            
                                Get Width of last line of multiline UILabel
                            
                                All browsers displaying italic text when not supposed to
                            
                                Xcode how to Autoshrink a UITextView?
                            
                                Qt outlined text without thinning font
                            
                                detect allusions (e.g. very fuzzy matches) in language of inaugural addresses
                            
                                UITextView with emoticons editing
                            
                                Store large amounts of texts for Android app
                            
                                justify text by changing the width of container
                            
                                c++ read Arabic text from file
                            
                                Newline in telegram inline keyboard for python
                            
                                WPF - Best Practice for the run-of-the-mill [Label:Input] Control
                            
                                Seeking algo for text diff that detects and can group similar lines
                            
                                TextRenderer.DrawText renders Arial differently on XP vs Vista
                            
                                If I want formatted text but don't want to use UIWebView, is Core Text my only remaining option?
                            
                                Rails sms_fu error
                            
                                Different sizes on 2 devices, even if I use "dp"
                            
                                In libgdx model-loader demo font.draw() is not putting "fps" text on screen
                            
                                Adding text to image and save
                            
                                Vertical alignment of rotated text in a table cell

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

In natural language processing (NLP), how do you make an efficient dimension reduction?

Tags:

text

machine-learning

nlp

dimension-reduction

dimensionality-reduction

zxzx179

People also ask

1 Answers

Andreas Mueller

Recent Activity

Donate For Us