In SVC from Sklearn, why is the training time not strictly linear to maximum iteration when label size is big?

Tags:

I have doing an analysis trying to see the relation between training time and maximum iteration in SVC. The data I use is some randomly generated number and I plotted the training time against max_iter of SVC fit. I checked logs and each binary classifier has reached the max_iter (I output all console logs which showed detailed warning for each binary classifier and count them). However, I was assuming that the training time will be strictly linear to the iteration but actually, in the case that the training data has many labels e.g. say 40, then the plot does not show it's linear. enter image description here

It seems as the maximum iteration goes up, each iteration takes slight less time than before. While if we changed label_size to be 2 (which means each fit contains only 1 binary classifier), the line is straight.

enter image description here

What causes that to happen?

Here is my source code:

# -*- coding: utf-8 -*-
import numpy as np
from sklearn.svm import SVC
import time
import pandas as pd


def main(row_size, label_size):
    np.random.seed(2019)
    y = np.array([i for i in range(label_size) for j in range(row_size
                 / label_size)])
    if len(y) < row_size:
        y = np.append(y, [y[-1]] * (row_size - len(y)))
    X = np.random.rand(row_size, 300)
    print X.shape, y.shape
    return (X, y)


def train_svm(X, y, max_iter):
    best_params = {'C': 1}
    clf = SVC(
        C=best_params['C'],
        kernel=str('linear'),
        probability=False,
        class_weight='balanced',
        max_iter=max_iter,
        random_state=2018,
        verbose=True,
        )
    start = time.time()
    clf.fit(X, y)
    end = time.time()
    return end - start


if __name__ == '__main__':
    row_size = 20000
    m_iter = range(10, 401, 20)
    label_size = [40]
    data = {
        'label_size': [],
        'max_iter': [],
        'row_size': [],
        'time': [],
        }
    for it in m_iter:
        for l in label_size:
            (X, y) = main(row_size, l)
            t = train_svm(X, y, max_iter=it)
            data['label_size'].append(l)
            data['max_iter'].append(it)
            data['row_size'].append(row_size)
            data['time'].append(t)
            df = pd.DataFrame(data)
            df.to_csv('svc_iter.csv', index=None)

981

asked Sep 29 '18 23:09

Richie F.

1 Answers

Well, there could be loads of reasons for that "very slight change". Scikit-Learn doesn't operate natively, it's built upon different libraries and it may be using loads of optimizers..etc!

Besides, your first graph is very close to linear!

Nevertheless, a big noticeable reasonable factor that contributed in those tiny changes is the Decomposition Method in Support Vector Machine.

The idea of decomposition methodology for classification tasks is to break down a complex classification task into several simpler and more manageable sub-tasks that are solvable by using existing induction methods, then joining their solutions together in order to solve the original problem. This method is an iterative process and in each iteration only few variables are updated.

For more details about the mathematical approach, please refer to this paper, section 6.2 The Decomposition Method..

Moreover and specifically speaking, SVM implements two tricks called shrinking and caching for the decomposition method.

The Shrinking idea is that an optimal solution α of the SVM dual problem may contain some bounded elements (i.e., α_i = 0 or C). These elements may have already been bounded in the middle of the decomposition iterations. To save the training time, the shrinking technique tries to identify and remove some bounded elements, so a smaller optimization problem is solved.
The Caching idea is an effective technique for reducing the computational time of the decomposition method, so elements are calculated as needed. We can use available memory (called kernel cache) to store some recently used permutation of the matrix Q_ij. Then, some kernel elements may not need to be recalculate.

For more details about the mathematical approach, please refer to this paper, section 5 Shrinking and Caching.

Technical Proof:

I repeated your experiment (that's way I asked for your code to follow the same exact approach), with and without using the shrinking and ~~caching~~ optimization.

Using Shrinking and Caching

The default value of the parameter shrinking in sklearn SVC is set to True, keeping that as it is, produced the following output:

with shrinking table

Plotting it gives:

with shrinking plot

Note how at some point, the time drops noticeably reflecting the effect of shrinking and caching.

Without Using Shrinking and Caching

Using the same exact approach but this time, setting the parameter shrinking explicitly to False as follows:

 clf = SVC(
        C=best_params['C'],
        kernel=str('linear'),
        probability=False,
        class_weight='balanced',
        max_iter=max_iter,
        random_state=2018,
        verbose=True,
        shrinking=False
        )

Produced the following output:

without shrinking table

Plotting it gives:

without shrinking plot

Note how unlike previous plot, there is no noticeable drop in time at some point, it's rather just a very tiny fluctuations along with the entire plot.

Comparing Pearson Correlations

Pearson correlation

In conclusion:

Without using the Shrinking and ~~Caching~~ (updated later with caching), the linearity improved, although it's not 100% linear, but if you take into account that Scikit-Learn internally uses libsvm library to handle all computations. And this library is wrapped using C and Cython, you would have a higher tolerance to your definition about 'Linear' of the relationship between maximum iterations and time. Also, here is a cool discussion about why algorithms may not give the exact same precise definite running time every time.

And that would be even clearer to you if you plot the interval times, so you can see clearly how the drops happen suddenly noticeably in more than one place.

interval plot with shrinking

While it keeps almost the same flow without using the optimization tricks.

interval plot without shrinking

Important Update

It turned out that the aforementioned reason for this issue (i.e. Shrinking and Caching) is correct, or more precisely, it's a very big factor of that phenomenon.

But the thing I missed is the following:

I was talking about Shrinking and Caching but I missed the later parameter for caching which is set by default to 200 MB.

Repeating the same simulations more than one time and setting the cache_size parameter to a very small number (because zero is not acceptable and throws an error) in addition to shrinking=False, resulted in an extremely-close-to linear pattern between max_iter and time:

clf = SVC(
        C=best_params['C'],
        kernel=str('linear'),
        probability=False,
        class_weight='balanced',
        max_iter=max_iter,
        random_state=2018,
        verbose=False,
        shrinking=False,
        cache_size = 0.000000001
        )

final result

By the way, you don't need to set verbose=True, you can check if it reached the maximum iteration via the ConvergenceWarning, so you can redirect those warnings to a file and it'll be million times easier to follow, just add this code:

import warnings, sys
def customwarn(message, category, filename, lineno, file=None, line=None):
    with open('warnings.txt', 'a') as the_file:
        the_file.write(warnings.formatwarning(message, category, filename, lineno))
warnings.showwarning = customwarn

Also you don't need to re-generate the dataset after each iteration, so take it out the loop like this:

(X, y) = main(row_size, 40)
    for it in m_iter:
        ....
        ....

Final Conclusion

Shrinking and Caching tricks coming from Decomposition Method in SVM play a big significant role in improving the execution time as the number of iterations increases. Besides, there are other small players that may be contributing in this matter such as internal usage of libsvm library to handle all computations which is wrapped using C and Cython.

answered Oct 20 '22 08:10

Yahya

Related questions
                            
                                Django JSONField filtering Queryset
                            
                                Python: Hello world with Flask gives me an error related to app.run(debug=True) [duplicate]
                            
                                How to use Vectorization with NumPy arrays to calculate geodesic distance using Geopy library for a large dataset?
                            
                                How to install python packages in a Google Dataproc cluster
                            
                                Python Speech recognition produces bad results
                            
                                How is Nesterov's Accelerated Gradient Descent implemented in Tensorflow?
                            
                                Creating Hypertables through SQL Alchemy
                            
                                Randomly select values from list but with character length restriction
                            
                                Change default location log file generated by logger in python
                            
                                Django - form_valid() vs save()
                            
                                Convert a black and white image to array of numbers?
                            
                                Selenium webdriver: firefox headless inject javascript to modify browser property
                            
                                Efficient Method of finding common files between two given paths in Python
                            
                                Asyncio How do you use run_forever?
                            
                                Pandas Merge two rows into a single row based on columns
                            
                                don't understand this lambda expression with defaultdict
                            
                                Most elegant way to assign multiple variables to the same value?
                            
                                Why can tf.image.decode_jpeg decode a png?
                            
                                jinja2.exceptions.TemplateSyntaxError: expected token 'end of print statement', got 'posted'
                            
                                Are random seeds compatible between systems?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With