Optimize log entropy calculation in sparse matrix

Tags:

I have a 3007 x 1644 dimensional matrix of terms and documents. I am trying to assign weights to frequency of terms in each document so I'm using this log entropy formula http://en.wikipedia.org/wiki/Latent_semantic_indexing#Term_Document_Matrix (See entropy formula in the last row).

I'm successfully doing this but my code is running for >7 minutes. Here's the code:

int N = mat.cols();
for(int i=1;i<=mat.rows();i++){
    double gfi = sum(mat(i,colon()))(1,1); //sum of occurrence of terms
    double g =0;
    if(gfi != 0){// to avoid divide by zero error

        for(int j = 1;j<=N;j++){
            double tfij = mat(i,j);
            double pij = gfi==0?0.0:tfij/gfi;
            pij = pij + 1; //avoid log0
            double G = (pij * log(pij))/log(N);
            g = g + G;
        }
    }

    double gi = 1 - g;
    for(int j=1;j<=N;j++){
        double tfij = mat(i,j) + 1;//avoid log0
        double aij = gi * log(tfij);
        mat(i,j) = aij;
    }
}

Anyone have ideas how I can optimize this to make it faster? Oh and mat is a RealSparseMatrix from amlpp matrix library.

UPDATE Code runs on Linux mint with 4gb RAM and AMD Athlon II dual core

Running time before change: > 7mins

After @Kereks answer: 4.1sec

401

asked Dec 31 '13 13:12

Emmanuel John

1 Answers

Here's a very naive rewrite that removes some redundancies:

int const N = mat.cols();
double const logN = log(N);

for (int i = 1; i <= mat.rows(); ++i)
{
    double const gfi = sum(mat(i, colon()))(1, 1);  // sum of occurrence of terms
    double g = 0;

    if (gfi != 0)
    {
        for (int j = 1; j <= N; ++j)
         {
            double const pij = mat(i, j) / gfi + 1;
            g += pij * log(pij);
        }
        g /= logN;
    }

    for (int j = 1; j <= N; ++j)
    {
        mat(i,j) = (1 - g) * log(mat(i, j) + 1);
    }
}

Also make sure that the matrix data structure is sane (e.g. a flat array accessed in strides; not a bunch of dynamically allocated rows).

Also, I think the first + 1 is a bit silly. You know that x -> x * log(x) is continuous at zero with limit zero, so you should write:

double const pij = mat(i, j) / gfi;
if (pij != 0) { g += pij + log(pij); }

In fact, you might even write the first inner for loop like this, avoiding a division when it isn't needed:

        for (int j = 1; j <= N; ++j)
        {
            if (double pij = mat(i, j))
            {
                pij /= gfi;
                g += pij * log(pij);
            }
        }

106

answered Oct 13 '22 03:10

Kerrek SB

Related questions
                            
                                Can a C++11 lambda capture a file scope variable?
                            
                                Static controls slightly flicker when main window is resized
                            
                                python's yield feature in C/C++? [duplicate]
                            
                                OpenCV mat::at throwing exception
                            
                                Compile with -g option but "Single stepping until exit from function main, which has no line number information"
                            
                                Can a process be stopped via a systemtap probe so gdb can be attached?
                            
                                Convert from 'QChar' to 'wchar_t'
                            
                                What does C++03 12.4/12 say about calling a base class destructor explicitly through the pointer?
                            
                                How to call not-thread safe DLL in multi-thread in C++?
                            
                                Proper way to extend the functionality of a container (like std::vector) in C++, without inheriting from it?
                            
                                Auto-connection signals with GtkBuilder but on GTKmm
                            
                                Hiding longjmps in C++ interface to C code
                            
                                Disable application after expiry date for trial
                            
                                Change color highlight of icons in QTableView, when the cell is selected
                            
                                C++11 lockless queue using std::atomic (multi writer, single consumer)
                            
                                Can I implement different implementation to same function from different base classes?
                            
                                Why do we need *.lib files? [closed]
                            
                                Redundancy of structures
                            
                                C++ Vector Range Constructor
                            
                                Google test Gtest.cc:812: error: 'gettimeofday' was not declared in this scope

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Optimize log entropy calculation in sparse matrix

Tags:

c++

c

math

Emmanuel John

People also ask

1 Answers

Kerrek SB

Recent Activity

Donate For Us