How to train large Dataset for classification

Tags:

I have a training dataset of 1600000 tweets. How can I train this type of huge data.

I have tried something using nltk.NaiveBayesClassifier. It will take more than 5 days to train if I run.

def extract_features(tweet):

    tweet_words = set(tweet)
    features = {}
    for word in featureList:
        features['contains(%s)' % word] = (word in tweet_words)

    return features


training_set = nltk.classify.util.apply_features(extract_features, tweets)

NBClassifier = nltk.NaiveBayesClassifier.train(training_set)  # This takes lots of time

What should I do?

I need to classify my Dataset using SVM and naive bayes.

Dataset I want to use : Link

Sample(training Dataset):

Label     Tweet
0         url aww bummer you shoulda got david carr third day
4         thankyou for your reply are you coming england again anytime soon

Sample(testing Dataset):

Label     Tweet
4         love lebron url
0         lebron beast but still cheering the til the end
^
I have to predict Label 0/4 only

How can I train this huge dataset efficiently?

449

asked Jan 14 '15 22:01

Shahriar

3 Answers

Following what superbly proposed about the features extraction you could use the tfidvectorizer in scikit library to extract the important words from the tweets. Using the default configuration, coupled with a simple LogisticRegression it gives me 0.8 accuracy.Hope that helps. Here is an example on how to use it for you problem:

    train_df_raw = pd.read_csv('train.csv',header=None, names=['label','tweet'])
test_df_raw = pd.read_csv('test.csv',header=None, names=['label','tweet'])
train_df_raw =  train_df_raw[train_df_raw['tweet'].notnull()]
test_df_raw =  test_df_raw[test_df_raw['tweet'].notnull()]
test_df_raw =  test_df_raw[test_df_raw['label']!=2]

y_train = [x if x==0 else 1 for x in train_df_raw['label'].tolist()]
y_test = [x if x==0 else 1 for x in test_df_raw['label'].tolist()]
X_train = train_df_raw['tweet'].tolist()
X_test = test_df_raw['tweet'].tolist()

print('At vectorizer')
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train)
print('At vectorizer for test data')
X_test = vectorizer.transform(X_test)

print('at Classifier')
classifier = LogisticRegression()
classifier.fit(X_train, y_train)

predictions = classifier.predict(X_test)
print 'Accuracy:', accuracy_score(y_test, predictions)

confusion_matrix = confusion_matrix(y_test, predictions)
print(confusion_matrix)

Accuracy: 0.8
[[135  42]
 [ 30 153]]

answered Nov 14 '22 10:11

farmi

Before speeding up the training I'd personally make sure that you actually need to. While not a direct answer to your question, I'll try to provide a different angle which you might or might not be missing (hard to tell from your initial post).

Take e.g. superbly's implementation as a baseline. 1.6Mio training and 500 test samples with 3 features yields 0.35 accuracy.

Using the exact same setup, you can go as low as 50k training samples without losing accuracy, in fact the accuracy will slightly go up - probably because you are overfitting with that many examples (you can check this running his code with a smaller sample size). I'm pretty sure that using a neural network at this stage would give horrible accuracy with this setup (the SVM can be kinda tuned to overcome overfitting though that's not my point).

You wrote in your initial post that you have 55k features (which you deleted for some reason?). This number should correlate with your training set size. Since you didn't specify your list of features it's not really possible to give you a proper working model or test my assumption.

However, I highly suggest that you reduce your training data as a first step and see a) how well you perform and b) at which point possible overfitting occurs. I would also adjust the test size to be of a higher size. 500-1.6Mio is kind of a weird split of the sets. Try 80/20% for train/test. As a third step check your feature list size. Is it representative of what you need? If there's unnecessary/duplicate features in that list, you should consider pruning.

As a final thought, if you come back to longer training sizes (e.g. because you decide that you do in fact need much more data than provided now), consider if slow learning really is an issue (besides testing your model). Many state-of-the-art classifiers are trained for days/weeks using GPU computing. Training time doesn't matter in that case because they're only trained once and possibly only updated with small batches of data when they "go online".

answered Nov 14 '22 10:11

runDOSrun

I have an option here. It took 3 minutes on my machine (I should really get a new one :P).

macbook 2006
2 GHz Intel Core 2 Duo
2 GB DDR2 SDRAM

The achieved accuracy was: 0.355421686747

I'm sure if you tune the vector machine you can get better results.

First I changed the format of the csv files so it can be easier imported. I just replaced the first whitespace with a comma which can be used as delimiter during import.

cat testing.csv | sed 's/\ /,/' > test.csv
cat training.csv | sed 's/\ /,/' > train.csv

In python I used pandas to read the csv files and list comprehension to extract the features. This is much faster than for loops. Afterwards I used sklearn to train a support vector machine.

import pandas
from sklearn import svm
from sklearn.metrics import accuracy_score

featureList = ['obama','usa','bieber']

train_df = pandas.read_csv('train.csv',sep=',',dtype={'label':int, 'tweet':str})
test_df = pandas.read_csv('test.csv',sep=',',dtype={'label':int, 'tweet':str})

train_features = [[w in str(tweet) for w in featureList] for tweet in train_df.values[:,1]]
test_features = [[w in str(tweet) for w in featureList] for tweet in test_df.values[:,1]]
train_labels = train_df.values[:,0]
test_labels = test_df.values[:,0]

clf = svm.SVC(max_iter=1000)
clf.fit(train_features, train_labels)
prediction = clf.predict(test_features)

print 'accuracy: ',accuracy_score(test_labels.tolist(), prediction.tolist())

answered Nov 14 '22 12:11

mjspier

Related questions
                            
                                How to write two sheets in a single workbook at the same time using openpyxl
                            
                                When reading multiple config files, ConfigParser overwrites previous files, why?
                            
                                Gmail Python multiple attachments
                            
                                UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 47: ordinal not in range(128)
                            
                                Why object primary keys increment between tests in Django?
                            
                                C++ class not recognized by Python 3 as a module via Boost.Python Embedding
                            
                                Python dict.get() raises KeyError
                            
                                Django Rest Framework + Django-Allauth Password Reset/Recovery
                            
                                Find out largest string value size for keys in Redis database
                            
                                pycharm: byte literal contains characters > 255
                            
                                Django signals doesn't work
                            
                                Python pandas cumsum() reset after hitting max
                            
                                Is it possible to change turtle's pen stroke?
                            
                                Unresolved attribute reference 'objects' for class 'Foo' in PyCharm
                            
                                how do I close window with handle using win32gui in Python
                            
                                Django - Mocking the save method on a model
                            
                                matplotlib: make legend keys square
                            
                                What is the Java equivalent to Python's reduce function?
                            
                                Python cartesian product and conditions?
                            
                                dir inside function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to train large Dataset for classification

Tags:

python

classification

naivebayes

svm

nltk