Can somebody tell me, meaning for below code and on outputs? I did create Corpus here
frequencies = DocumentTermMatrix(corpus)
frequencies
output is
<<DocumentTermMatrix (documents: 299, terms: 1297)>>
Non-/sparse entries: 6242/381561
Sparsity : 98%
Maximal term length: 19
Weighting : term frequency (tf)
And code for sparse is here.
sparse = removeSparseTerms(frequencies, 0.97)
sparse
output is
> sparse
<<DocumentTermMatrix (documents: 299, terms: 166)>>
Non-/sparse entries: 3773/45861
Sparsity : 92%
Maximal term length: 10
Weighting : term frequency (tf)
What is happening over here , What does Non-/sparse entries and Sparsity mean? Can somebody help me in understanding these.
Thank you.
By this code you have created a document term matrix of the corpus
frequencies = DocumentTermMatrix(corpus)
Document Term Matrix (DTM) lists all occurrences of words in the corpus, by document. In the DTM, the documents are represented by rows and the terms (or words) by columns. If a word occurs in a particular document, then the matrix entry for corresponding to that row and column is 1, else it is 0 (multiple occurrences within a document are recorded – that is, if a word occurs twice in a document, it is recorded as “2” in the relevant matrix entry).
As an example consider corpus of having two documents.
Doc1: bananas are good
Doc2: bananas are yellow
DTM for the above corpus would look like
banana are yellow good
Doc1 1 1 1 0
Doc2 1 1 0 1
The output
<<DocumentTermMatrix (documents: 299, terms: 1297)>>
Non-/sparse entries: 6242/381561
Sparsity : 98%
Maximal term length: 19
Weighting : term frequency (tf)
The output signifies that DTM has 299 entries which has over 1297 terms which have appeared at least once.
sparse = removeSparseTerms(frequencies, 0.97)
Now you are removing those terms which don't appear too often in your data. We will remove any element that doesn't appear in atleast 3% of the entries (or documents). Relating to the above created DTM we are basically removing those columns whose entries are 1 in least number of documents.
Now if you look at the output
> sparse
<<DocumentTermMatrix (documents: 299, terms: 166)>>
Non-/sparse entries: 3773/45861
Sparsity : 92%
Maximal term length: 10
Weighting : term frequency (tf)
The number of entries (documents) are still the same i.e 299 but number of terms terms which have appeared at least once has changed to 166.
Non-/sparse entries: 6242/381561
Sparsity : 98%
This reads like 381561
cells in frequencies
are 0
, 6242
have non-zero values. 98%
of all cells are zero (which is 381561/(381561+6242)
)
removeSparseTerms(frequencies, 0.97)
removes those terms in frequencies
, for which at least 97%
of all cells are zero, i.e. which are quite uncommon in the corpus. As a result, you get a new DocumentTermMatrix
with 166
terms and only 45861
zero entries.
Sparsity is a common term. In text mining you often get very large matrices with many cells being zero. It might be clever to not store all cells one-by-one in memory, but to just store the few non-zero entries + their positions to save memory. You can read more about that by looking for sparse matrices.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With