How can I handle huge matrices?

Tags:

I am performing topic detection with supervised learning. However, my matrices are very huge in size (202180 x 15000) and I am unable to fit them into the models I want. Most of the matrix consists of zeros. Only logistic regression works. Is there a way in which I can continue working with the same matrix but enable them to work with the models I want? Like can I create my matrices in a different way?

Here is my code:

Click to copy

import numpy as np
import subprocess
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression

from sklearn import metrics

def run(command):
    output = subprocess.check_output(command, shell=True)
    return output

Load Vocabulary

Click to copy

 f = open('/Users/win/Documents/wholedata/RightVo.txt','r')
    vocab_temp = f.read().split()
    f.close()
    col = len(vocab_temp)
    print("Training column size:")
    print(col)

Create Train Matrix

Click to copy

row = run('cat '+'/Users/win/Documents/wholedata/X_tr.txt'+" | wc -l").split()[0]
print("Training row size:")
print(row)
matrix_tmp = np.zeros((int(row),col), dtype=np.int64)
print("Train Matrix size:")
print(matrix_tmp.size)

label_tmp = np.zeros((int(row)), dtype=np.int64)
f = open('/Users/win/Documents/wholedata/X_tr.txt','r')
count = 0
for line in f:
    line_tmp = line.split()
    #print(line_tmp)
    for word in line_tmp[0:]:
        if word not in vocab_temp:
            continue
        matrix_tmp[count][vocab_temp.index(word)] = 1
    count = count + 1
f.close()
print("Train matrix is:\n ")
print(matrix_tmp)
print(label_tmp)
print("Train Label size:")
print(len(label_tmp))

f = open('/Users/win/Documents/wholedata/RightVo.txt','r')
vocab_tmp = f.read().split()
f.close()
col = len(vocab_tmp)
print("Test column size:")
print(col)

Make test matrix

Click to copy

row = run('cat '+'/Users/win/Documents/wholedata/X_te.txt'+" | wc -l").split()[0]
print("Test row size:")
print(row)
matrix_tmp_test = np.zeros((int(row),col), dtype=np.int64)
print("Test matrix size:")
print(matrix_tmp_test.size)

label_tmp_test = np.zeros((int(row)), dtype=np.int64)

f = open('/Users/win/Documents/wholedata/X_te.txt','r')
count = 0
for line in f:
    line_tmp = line.split()
    #print(line_tmp)
    for word in line_tmp[0:]:
        if word not in vocab_tmp:
            continue
        matrix_tmp_test[count][vocab_tmp.index(word)] = 1
    count = count + 1
f.close()
print("Test Matrix is: \n")
print(matrix_tmp_test)
print(label_tmp_test)

print("Test Label Size:")
print(len(label_tmp_test))

xtrain=[]
with open("/Users/win/Documents/wholedata/Y_te.txt") as filer:
    for line in filer:
        xtrain.append(line.strip().split())
xtrain= np.ravel(xtrain)
label_tmp_test=xtrain

ytrain=[]
with open("/Users/win/Documents/wholedata/Y_tr.txt") as filer:
    for line in filer:
        ytrain.append(line.strip().split())
ytrain = np.ravel(ytrain)
label_tmp=ytrain

Load Supervised Model

Click to copy

model = LogisticRegression()
model = model.fit(matrix_tmp, label_tmp)
#print(model)
print("Entered 1")
y_train_pred = model.predict(matrix_tmp_test)
print("Entered 2")
print(metrics.accuracy_score(label_tmp_test, y_train_pred))

867

asked Jan 16 '16 07:01

minks

1 Answers

You can use a particular data structure available in the scipy package called sparse matrix: http://docs.scipy.org/doc/scipy/reference/sparse.html

According to the definition:

A sparse matrix is simply a matrix with a large number of zero values. In contrast, a matrix where many or most entries are non-zero is said to be dense. There are no strict rules for what constitutes a sparse matrix, so we'll say that a matrix is sparse if there is some benefit to exploiting its sparsity. Additionally, there are a variety of sparse matrix formats which are designed to exploit different sparsity patterns (the structure of non-zero values in a sparse matrix) and different methods for accessing and manipulating matrix entries.

129

answered Oct 31 '22 17:10

SimoV8

Related questions
                            
                                Python/Scikit-learn/regressions - from pandas Dataframes to Scikit prediction
                            
                                python multiprocessing writing to shared file
                            
                                The callback does not call when multiprocessing
                            
                                scipy.signal.find_peaks_cwt parameters
                            
                                How to connect to a Wi-Fi network using Python on OS X?
                            
                                Tensorflow: List of Tensors for Cost
                            
                                How to workwith generators from file for tokenization rather than materializing a list of strings?
                            
                                Limiting queryset for foreign key for inline formset in Django
                            
                                Why do Python/Numpy require a row vector for matrix/vector dot product?
                            
                                Why time() below 0.25 skips animation in Python?
                            
                                Python ImportError after setup.py
                            
                                how to use mask to remove the background in python
                            
                                Print a C string in Python
                            
                                Difference between 'for a[-1] in a' and 'for a in a' in Python?
                            
                                Convert graphlab sframe into a dictionary of {key: values}
                            
                                Pylint "unable to import" error but works fine with Pycharm
                            
                                How to get the longest length string/integer/float from a pandas column when there are strings in the column
                            
                                How to ignore loading huge fields in django admin list_display?
                            
                                How to debug Django request "POST /url/ HTTP/1.1" 400
                            
                                Most Pythonic way to print *at most* some number of decimal places [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I handle huge matrices?

Tags:

python

machine-learning

matrix

Load Vocabulary

Create Train Matrix

Make test matrix

Load Supervised Model

minks

People also ask

1 Answers

SimoV8

Recent Activity

Donate For Us