python speed processing per line VS in chunk

Tags:

I am trying to perform very simple computations on a huge file like counting the numbers of label for some columns or the average and standard deviation for other columns. The file is too big to fit in memory and I am currently processing it per line with:

Click to copy

unique = {key: [] for key in categorical_keys}
means = {key: 0.0 for key in numerical_keys}
sds = {key: 0.0 for key in numerical_keys}
with open('input/train.csv', 'r') as read_file:
    reader = csv.DictReader(read_file, delimiter=',', quotechar='|')
    for i, row in enumerate(reader):
        for key, value in row.iteritems():
            if key in categorical_keys:
                if row[key] not in unique[key]:
                    unique[key].extend([value])
            elif key in numerical_keys:
                if value:
                    means[key] = (means[key]*i + float(value))/(i+1)
                    if i > 1:
                        sds[key] = (sds[key]*(i-1) + (float(value)-means[key])**2)/i

Now this seems to be too slow and I am wondering if it would faster to process it in chunk that could fit in memory. Would it be faster ? If yes, why ?

Thanks for your help.

894

asked May 04 '16 23:05

Romain

1 Answers

Loop optimizations

If you need to gain some speed:

be sure, you really need the speedup (otherwise you spend too much time on worthless task)
start with loops
- check, if some loops can be prevented
- optimize/remove instructions inside the loop
  - each instruction counts
  - each reference counts

Here is my draft of optimized code (not tested):

Click to copy

from collections import defaultdict


unique = defaultdict(set)
means = {key: 0.0 for key in numerical_keys}
sds = {key: 0.0 for key in numerical_keys}
with open('input/train.csv', 'r') as read_file:
    reader = csv.DictReader(read_file, delimiter=',', quotechar='|')
    for i, row in enumerate(reader):
        for key in categorical_keys:
            unique[key].add(row[key])
        for key in numerical_keys:
            try:
                # shall throw ValueError if None or empty string
                value=float(row[key])
                mean_val = (means[key]*i + value)/(i+1)
                means[key] = mean_val
                # following fails for i < = 1 with ZeroDivisionError
                sds[key] = (sds[key]*(i-1) + (value-mead_val)**2)/i
            except (ValueError, ZeroDivisionError):
                pass

Collecting unique values

You use dict with list of unique values:

Click to copy

unique = {key: [] for key in categorical_keys}

and add to it unique values as list item (happens within loop):

Click to copy

if key in categorical_keys:
    if row[key] not in unique[key]:
        unique[key].extend([value])

You could safe some test (if the value exists in the list) if you directly add the value into set - the set will take care, only unique values are collected there.

Using defaultdict will assure, you have already some empty set present in case you use any key, which was not used yet.

Not testing type of record keys in each loop, know them in advance

Your code repeatedly loops over record keys and test them for type, then does something:

Click to copy

        if key in categorical_keys:
            if row[key] not in unique[key]:
                unique[key].extend([value])
        elif key in numerical_keys:
            if value:
                means[key] = (means[key]*i + float(value))/(i+1)
                if i > 1:
                    sds[key] = (sds[key]*(i-1) + (float(value)-means[key])**2)/i

You can safe these tests, if your categorical_keys and numerical_keys are properly set to existing values. Then you can directly loop over known key names:

Click to copy

        for key in categorical_keys:
            unique[key].add(row[value])
        for key in numerical_keys:
            try:
                # shall throw ValueError if None or empty string
                value=float(row[value])
                means[key] = (means[key]*i + value)/(i+1)
                if i > 1:
                    sds[key] = (sds[key]*(i-1) + (value-means[key])**2)/i
            except ValueError:
                pass

Resuse once calculated value

Your code repeatedly calculates the value:

Click to copy

float(value)

Do it once and reuse.

Also the mean[key] value is once calculated and set into means[key], and two lines later using the value again. Better store the value in local variable and use it twice. Any lookup (like means[key]) costs something.

Catching exception is mostly faster then value test

Your code tests for value not being empty:

Click to copy

        elif key in numerical_keys:
            if value:
                # something here

You can replace it with code, which directly works with the value. If the value is wrong, it will fail and exception ValueError will be catched and ignored. If you have most values set, this will speed it up.

Click to copy

            try:
                value=float(value)
                means[key] = (means[key]*i + value)/(i+1)
                if i > 1:
                    sds[key] = (sds[key]*(i-1) + (value-means[key])**2)/i
            except ValueError:
                pass

Can you prevent the `if i > 1:` test?

This condition is in most cases true, but you test it in each loop. If you find a way (I did not) to prevent this test, you get it faster too.

As you proposed, we can resolve it by catching ZeroDivisionError for i <= 1:

Click to copy

            try:
                # shall throw ValueError if None or empty string
                value=float(value)
                means[key] = (means[key]*i + value)/(i+1)
                # for i <= 1 shall raise ZeroDivisionError
                sds[key] = (sds[key]*(i-1) + (value-means[key])**2)/i
            except (ValueError, ZeroDivisionError):
                pass

Processing data in chunks

Regarding processing in chunks:

it definitely adds some complexity
it might speed the program up, if done carefully
it might slow it down or provide wrong results

Where chunking can improve the speed

Reading files in larger pieces

This sounds obvious, but libraries already take care of it. Expect minor or none improvement.

Getting CSV records in chunks

I am not aware of a method of csv.reader or csv.DictReader allowing to get chunk of records directly. You would have to do it yourself. This is possible, I would recommend using itertools.groupby.

Do not expect speedup from it on its own (it will slow it down a bit), but it is prerequisite for other chunk-based speedups later on.

Adding chunk of values to a set

The code (currently) adds values to a set one by one. If you do it with chunk of values (the more, the better), it will be faster as each python call has some small overhead.

Calculating mean and sds values

You could take advantage of statistics package, which is likely to have optimized code (but it seems to be in pure python anyway).

Anyway, as you are going to process the data in chunks, simple statistics.mean will not work for you or you would have to aggregate the results somehow together (if it is even possible).

If you calculate the value yourself, with careful coding you can get some speedup, based mostly on the fact, you get values directly in one chunk and will not have to dereference value by value in each loop.

Chunking conclusions

To me the chunking optimization seems to be too complex and it is hard to predict, if it brings any value.

answered Oct 12 '22 11:10

Jan Vlcinsky

Related questions
                            
                                Do a pairwise comparison of each element in two sets and return a top 3 ranklist
                            
                                Vectorize mask of squared euclidean distance in Python
                            
                                Requests CookieJar empty even thought the page have it
                            
                                Can't access returned h5py object instance
                            
                                How to show full Python Traceback with Tox/Py.test
                            
                                How to install/run Jupyter in Ubuntu 15.10?
                            
                                Why does Python "&=" set operator act differently than "&=" integer operation?
                            
                                Soundcloud API not returning all tracks from playlist through Python
                            
                                Difficulty in using sympy solver in python
                            
                                Merge pandas dataframe with unequal length
                            
                                Custom handshake data with Flask-SocketIO
                            
                                Altering the default Django messages tag
                            
                                Mocking pyodbc module calls for django unit tests
                            
                                Python: Designing a time-series filter after Fourier analysis
                            
                                How to convert a numpy matrix to a pandas series?
                            
                                Blurry edge detection
                            
                                Python: Predict the y value using Statsmodels - Linear Regression
                            
                                Failed processing format-parameters with mysql.connector in Python
                            
                                When is chr(ord(c)) not equal to c in Python?
                            
                                What's the Ruby equivalent of Python's defaultdict?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

python speed processing per line VS in chunk

Tags:

performance

python

chunks

Romain

People also ask

1 Answers

Loop optimizations

Collecting unique values

Not testing type of record keys in each loop, know them in advance

Resuse once calculated value

Catching exception is mostly faster then value test

Can you prevent the `if i > 1:` test?

Processing data in chunks

Where chunking can improve the speed

Reading files in larger pieces

Getting CSV records in chunks

Adding chunk of values to a set

Calculating mean and sds values

Chunking conclusions

Jan Vlcinsky

Recent Activity

Donate For Us

python speed processing per line VS in chunk

Tags:

performance

python

chunks

Romain

People also ask

1 Answers

Loop optimizations

Collecting unique values

Not testing type of record keys in each loop, know them in advance

Resuse once calculated value

Catching exception is mostly faster then value test

Can you prevent the if i > 1: test?

Processing data in chunks

Where chunking can improve the speed

Reading files in larger pieces

Getting CSV records in chunks

Adding chunk of values to a set

Calculating mean and sds values

Chunking conclusions

Jan Vlcinsky

Related questions

Recent Activity

Donate For Us

Can you prevent the `if i > 1:` test?