Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

text mining sparse/Non-sparse meaning

Tags:

r

text-mining

Can somebody tell me, meaning for below code and on outputs? I did create Corpus here

frequencies = DocumentTermMatrix(corpus)
frequencies

output is

<<DocumentTermMatrix (documents: 299, terms: 1297)>>
Non-/sparse entries: 6242/381561
Sparsity           : 98%
Maximal term length: 19
Weighting          : term frequency (tf)

And code for sparse is here.

sparse = removeSparseTerms(frequencies, 0.97)
sparse

output is

> sparse
<<DocumentTermMatrix (documents: 299, terms: 166)>>
Non-/sparse entries: 3773/45861
Sparsity           : 92%
Maximal term length: 10
Weighting          : term frequency (tf)

What is happening over here , What does Non-/sparse entries and Sparsity mean? Can somebody help me in understanding these.

Thank you.

like image 814
subro Avatar asked Dec 14 '22 05:12

subro


2 Answers

By this code you have created a document term matrix of the corpus

frequencies = DocumentTermMatrix(corpus)

Document Term Matrix (DTM) lists all occurrences of words in the corpus, by document. In the DTM, the documents are represented by rows and the terms (or words) by columns. If a word occurs in a particular document, then the matrix entry for corresponding to that row and column is 1, else it is 0 (multiple occurrences within a document are recorded – that is, if a word occurs twice in a document, it is recorded as “2” in the relevant matrix entry).

As an example consider corpus of having two documents.

Doc1: bananas are good

Doc2: bananas are yellow

DTM for the above corpus would look like

              banana          are        yellow       good
Doc1            1               1          1            0

Doc2            1               1          0            1

The output

<<DocumentTermMatrix (documents: 299, terms: 1297)>>
Non-/sparse entries: 6242/381561
Sparsity           : 98%
Maximal term length: 19
Weighting          : term frequency (tf)

The output signifies that DTM has 299 entries which has over 1297 terms which have appeared at least once.

sparse = removeSparseTerms(frequencies, 0.97)

Now you are removing those terms which don't appear too often in your data. We will remove any element that doesn't appear in atleast 3% of the entries (or documents). Relating to the above created DTM we are basically removing those columns whose entries are 1 in least number of documents.

Now if you look at the output

> sparse
<<DocumentTermMatrix (documents: 299, terms: 166)>>
Non-/sparse entries: 3773/45861
Sparsity           : 92%
Maximal term length: 10
Weighting          : term frequency (tf)

The number of entries (documents) are still the same i.e 299 but number of terms terms which have appeared at least once has changed to 166.

like image 56
Ravi Avatar answered Dec 30 '22 23:12

Ravi


Non-/sparse entries: 6242/381561
Sparsity : 98%

This reads like 381561 cells in frequencies are 0, 6242 have non-zero values. 98% of all cells are zero (which is 381561/(381561+6242))

removeSparseTerms(frequencies, 0.97) removes those terms in frequencies, for which at least 97% of all cells are zero, i.e. which are quite uncommon in the corpus. As a result, you get a new DocumentTermMatrix with 166 terms and only 45861 zero entries.

Sparsity is a common term. In text mining you often get very large matrices with many cells being zero. It might be clever to not store all cells one-by-one in memory, but to just store the few non-zero entries + their positions to save memory. You can read more about that by looking for sparse matrices.

like image 40
lukeA Avatar answered Dec 30 '22 23:12

lukeA