Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python speed processing per line VS in chunk

I am trying to perform very simple computations on a huge file like counting the numbers of label for some columns or the average and standard deviation for other columns. The file is too big to fit in memory and I am currently processing it per line with:

unique = {key: [] for key in categorical_keys}
means = {key: 0.0 for key in numerical_keys}
sds = {key: 0.0 for key in numerical_keys}
with open('input/train.csv', 'r') as read_file:
    reader = csv.DictReader(read_file, delimiter=',', quotechar='|')
    for i, row in enumerate(reader):
        for key, value in row.iteritems():
            if key in categorical_keys:
                if row[key] not in unique[key]:
                    unique[key].extend([value])
            elif key in numerical_keys:
                if value:
                    means[key] = (means[key]*i + float(value))/(i+1)
                    if i > 1:
                        sds[key] = (sds[key]*(i-1) + (float(value)-means[key])**2)/i

Now this seems to be too slow and I am wondering if it would faster to process it in chunk that could fit in memory. Would it be faster ? If yes, why ?

Thanks for your help.

like image 894
Romain Avatar asked May 04 '16 23:05

Romain


People also ask

Do functions make code run faster Python?

Python's built-in functions are one of the best ways to speed up your code. You must use built-in python functions whenever needed. These built-in functions are well tested and optimized.

Why does Python take so long to run code?

The main slowdown in your code comes from the fact that you look for all combinations of indices between the different words. Clearly, most of those combinations will not be even remotely eligible for the shortest run.

What is Python chunking?

Practical Data Science using Python Chunking is the process of grouping similar words together based on the nature of the word. In the below example we define a grammar by which the chunk must be generated. The grammar suggests the sequence of the phrases like nouns and adjectives etc.


1 Answers

Loop optimizations

If you need to gain some speed:

  • be sure, you really need the speedup (otherwise you spend too much time on worthless task)
  • start with loops
    • check, if some loops can be prevented
    • optimize/remove instructions inside the loop
      • each instruction counts
      • each reference counts

Here is my draft of optimized code (not tested):

from collections import defaultdict


unique = defaultdict(set)
means = {key: 0.0 for key in numerical_keys}
sds = {key: 0.0 for key in numerical_keys}
with open('input/train.csv', 'r') as read_file:
    reader = csv.DictReader(read_file, delimiter=',', quotechar='|')
    for i, row in enumerate(reader):
        for key in categorical_keys:
            unique[key].add(row[key])
        for key in numerical_keys:
            try:
                # shall throw ValueError if None or empty string
                value=float(row[key])
                mean_val = (means[key]*i + value)/(i+1)
                means[key] = mean_val
                # following fails for i < = 1 with ZeroDivisionError
                sds[key] = (sds[key]*(i-1) + (value-mead_val)**2)/i
            except (ValueError, ZeroDivisionError):
                pass

Collecting unique values

You use dict with list of unique values:

unique = {key: [] for key in categorical_keys}

and add to it unique values as list item (happens within loop):

if key in categorical_keys:
    if row[key] not in unique[key]:
        unique[key].extend([value])

You could safe some test (if the value exists in the list) if you directly add the value into set - the set will take care, only unique values are collected there.

Using defaultdict will assure, you have already some empty set present in case you use any key, which was not used yet.

Not testing type of record keys in each loop, know them in advance

Your code repeatedly loops over record keys and test them for type, then does something:

        if key in categorical_keys:
            if row[key] not in unique[key]:
                unique[key].extend([value])
        elif key in numerical_keys:
            if value:
                means[key] = (means[key]*i + float(value))/(i+1)
                if i > 1:
                    sds[key] = (sds[key]*(i-1) + (float(value)-means[key])**2)/i

You can safe these tests, if your categorical_keys and numerical_keys are properly set to existing values. Then you can directly loop over known key names:

        for key in categorical_keys:
            unique[key].add(row[value])
        for key in numerical_keys:
            try:
                # shall throw ValueError if None or empty string
                value=float(row[value])
                means[key] = (means[key]*i + value)/(i+1)
                if i > 1:
                    sds[key] = (sds[key]*(i-1) + (value-means[key])**2)/i
            except ValueError:
                pass

Resuse once calculated value

Your code repeatedly calculates the value:

float(value)

Do it once and reuse.

Also the mean[key] value is once calculated and set into means[key], and two lines later using the value again. Better store the value in local variable and use it twice. Any lookup (like means[key]) costs something.

Catching exception is mostly faster then value test

Your code tests for value not being empty:

        elif key in numerical_keys:
            if value:
                # something here

You can replace it with code, which directly works with the value. If the value is wrong, it will fail and exception ValueError will be catched and ignored. If you have most values set, this will speed it up.

            try:
                value=float(value)
                means[key] = (means[key]*i + value)/(i+1)
                if i > 1:
                    sds[key] = (sds[key]*(i-1) + (value-means[key])**2)/i
            except ValueError:
                pass

Can you prevent the if i > 1: test?

This condition is in most cases true, but you test it in each loop. If you find a way (I did not) to prevent this test, you get it faster too.

As you proposed, we can resolve it by catching ZeroDivisionError for i <= 1:

            try:
                # shall throw ValueError if None or empty string
                value=float(value)
                means[key] = (means[key]*i + value)/(i+1)
                # for i <= 1 shall raise ZeroDivisionError
                sds[key] = (sds[key]*(i-1) + (value-means[key])**2)/i
            except (ValueError, ZeroDivisionError):
                pass

Processing data in chunks

Regarding processing in chunks:

  • it definitely adds some complexity
  • it might speed the program up, if done carefully
  • it might slow it down or provide wrong results

Where chunking can improve the speed

Reading files in larger pieces

This sounds obvious, but libraries already take care of it. Expect minor or none improvement.

Getting CSV records in chunks

I am not aware of a method of csv.reader or csv.DictReader allowing to get chunk of records directly. You would have to do it yourself. This is possible, I would recommend using itertools.groupby.

Do not expect speedup from it on its own (it will slow it down a bit), but it is prerequisite for other chunk-based speedups later on.

Adding chunk of values to a set

The code (currently) adds values to a set one by one. If you do it with chunk of values (the more, the better), it will be faster as each python call has some small overhead.

Calculating mean and sds values

You could take advantage of statistics package, which is likely to have optimized code (but it seems to be in pure python anyway).

Anyway, as you are going to process the data in chunks, simple statistics.mean will not work for you or you would have to aggregate the results somehow together (if it is even possible).

If you calculate the value yourself, with careful coding you can get some speedup, based mostly on the fact, you get values directly in one chunk and will not have to dereference value by value in each loop.

Chunking conclusions

To me the chunking optimization seems to be too complex and it is hard to predict, if it brings any value.

like image 62
Jan Vlcinsky Avatar answered Oct 12 '22 11:10

Jan Vlcinsky