I have to deal with a large data-set. I need to store term frequency of each sentence; which I can do either using a dictionary list or using NumPy array.
But, I will have to sort and append (in case the word already exists)- Which will be better in this case?
The Solution to the problem you are describing is a scipy's sparse matrix.
A small example:
from scipy.sparse import csr_matrix
docs = [["hello", "world", "hello"], ["goodbye", "cruel", "world"]]
indptr = [0]
indices = []
data = []
vocabulary = {}
for d in docs:
for term in d:
index = vocabulary.setdefault(term, len(vocabulary))
indices.append(index)
data.append(1)
indptr.append(len(indices))
print csr_matrix((data, indices, indptr), dtype=int).toarray()
Each sentence is row, and each term is a column.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=2)
corpus = [
'This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?',
]
vectorizer = vectorizer.fit(corpus)
print vectorizer.vocabulary_
#prints {u'this': 4, u'is': 2, u'the': 3, u'document': 0, u'first': 1}
X = vectorizer.transform(corpus)
print X.toarray()
#prints
[[1 1 1 1 1]
[1 0 1 1 1]
[0 0 0 1 0]
[1 1 1 1 1]]
And now X
is your document-term matrix (Note that X is csr_matrix
). You can also use TfidfTransformer in case you want to tf-idf it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With