I am trying to perform very simple computations on a huge file like counting the numbers of label for some columns or the average and standard deviation for other columns. The file is too big to fit in memory and I am currently processing it per line with:
unique = {key: [] for key in categorical_keys}
means = {key: 0.0 for key in numerical_keys}
sds = {key: 0.0 for key in numerical_keys}
with open('input/train.csv', 'r') as read_file:
reader = csv.DictReader(read_file, delimiter=',', quotechar='|')
for i, row in enumerate(reader):
for key, value in row.iteritems():
if key in categorical_keys:
if row[key] not in unique[key]:
unique[key].extend([value])
elif key in numerical_keys:
if value:
means[key] = (means[key]*i + float(value))/(i+1)
if i > 1:
sds[key] = (sds[key]*(i-1) + (float(value)-means[key])**2)/i
Now this seems to be too slow and I am wondering if it would faster to process it in chunk that could fit in memory. Would it be faster ? If yes, why ?
Thanks for your help.
Python's built-in functions are one of the best ways to speed up your code. You must use built-in python functions whenever needed. These built-in functions are well tested and optimized.
The main slowdown in your code comes from the fact that you look for all combinations of indices between the different words. Clearly, most of those combinations will not be even remotely eligible for the shortest run.
Practical Data Science using Python Chunking is the process of grouping similar words together based on the nature of the word. In the below example we define a grammar by which the chunk must be generated. The grammar suggests the sequence of the phrases like nouns and adjectives etc.
If you need to gain some speed:
Here is my draft of optimized code (not tested):
from collections import defaultdict
unique = defaultdict(set)
means = {key: 0.0 for key in numerical_keys}
sds = {key: 0.0 for key in numerical_keys}
with open('input/train.csv', 'r') as read_file:
reader = csv.DictReader(read_file, delimiter=',', quotechar='|')
for i, row in enumerate(reader):
for key in categorical_keys:
unique[key].add(row[key])
for key in numerical_keys:
try:
# shall throw ValueError if None or empty string
value=float(row[key])
mean_val = (means[key]*i + value)/(i+1)
means[key] = mean_val
# following fails for i < = 1 with ZeroDivisionError
sds[key] = (sds[key]*(i-1) + (value-mead_val)**2)/i
except (ValueError, ZeroDivisionError):
pass
You use dict with list of unique values:
unique = {key: [] for key in categorical_keys}
and add to it unique values as list item (happens within loop):
if key in categorical_keys:
if row[key] not in unique[key]:
unique[key].extend([value])
You could safe some test (if the value exists in the list) if you directly add the value into set - the set will take care, only unique values are collected there.
Using defaultdict
will assure, you have already some empty set present in case you use any key,
which was not used yet.
Your code repeatedly loops over record keys and test them for type, then does something:
if key in categorical_keys:
if row[key] not in unique[key]:
unique[key].extend([value])
elif key in numerical_keys:
if value:
means[key] = (means[key]*i + float(value))/(i+1)
if i > 1:
sds[key] = (sds[key]*(i-1) + (float(value)-means[key])**2)/i
You can safe these tests, if your categorical_keys
and numerical_keys
are properly set to
existing values. Then you can directly loop over known key names:
for key in categorical_keys:
unique[key].add(row[value])
for key in numerical_keys:
try:
# shall throw ValueError if None or empty string
value=float(row[value])
means[key] = (means[key]*i + value)/(i+1)
if i > 1:
sds[key] = (sds[key]*(i-1) + (value-means[key])**2)/i
except ValueError:
pass
Your code repeatedly calculates the value:
float(value)
Do it once and reuse.
Also the mean[key]
value is once calculated and set into means[key]
, and
two lines later using the value again. Better store the value in local variable
and use it twice. Any lookup (like means[key]
) costs something.
Your code tests for value not being empty:
elif key in numerical_keys:
if value:
# something here
You can replace it with code, which directly works with the value. If the value is wrong, it will
fail and exception ValueError
will be catched and ignored. If you have most values set, this will speed it
up.
try:
value=float(value)
means[key] = (means[key]*i + value)/(i+1)
if i > 1:
sds[key] = (sds[key]*(i-1) + (value-means[key])**2)/i
except ValueError:
pass
if i > 1:
test?This condition is in most cases true, but you test it in each loop. If you find a way (I did not) to prevent this test, you get it faster too.
As you proposed, we can resolve it by catching ZeroDivisionError
for i <= 1:
try:
# shall throw ValueError if None or empty string
value=float(value)
means[key] = (means[key]*i + value)/(i+1)
# for i <= 1 shall raise ZeroDivisionError
sds[key] = (sds[key]*(i-1) + (value-means[key])**2)/i
except (ValueError, ZeroDivisionError):
pass
Regarding processing in chunks:
This sounds obvious, but libraries already take care of it. Expect minor or none improvement.
I am not aware of a method of csv.reader
or csv.DictReader
allowing to get
chunk of records directly. You would have to do it yourself. This is possible, I
would recommend using itertools.groupby
.
Do not expect speedup from it on its own (it will slow it down a bit), but it is prerequisite for other chunk-based speedups later on.
The code (currently) adds values to a set one by one. If you do it with chunk of values (the more, the better), it will be faster as each python call has some small overhead.
You could take advantage of statistics
package, which is likely to have
optimized code (but it seems to be in pure python anyway).
Anyway, as you are going to process the data in chunks, simple
statistics.mean
will not work for you or you would have to aggregate the
results somehow together (if it is even possible).
If you calculate the value yourself, with careful coding you can get some speedup, based mostly on the fact, you get values directly in one chunk and will not have to dereference value by value in each loop.
To me the chunking optimization seems to be too complex and it is hard to predict, if it brings any value.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With