I just tried to run this script with Python 3.3. Unfortunately it's about twice as slow than with Python 2.7.
#!/usr/bin/env python
from sys import stdin
def main():
for line in stdin:
try:
fields = line.split('"', 6)
print(fields[5])
except:
pass
if __name__ == '__main__':
main()
Here are the results:
$ time zcat access.log.gz | python3 -m cProfile ./ua.py > /dev/null
real 0m13.276s
user 0m18.977s
sys 0m0.484s
$ time zcat access.log.gz | python2 -m cProfile ./ua.py > /dev/null
real 0m6.139s
user 0m11.693s
sys 0m0.408s
Profiling shows that the additional time is spend in print:
$ zcat access.log.gz | python3 -m cProfile ./ua.py | tail -15
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:1594(_handle_fromlist)
196806 0.234 0.000 0.545 0.000 codecs.py:298(decode)
1 0.000 0.000 13.598 13.598 ua.py:3(<module>)
1 4.838 4.838 13.598 13.598 ua.py:6(main)
1 0.000 0.000 13.598 13.598 {built-in method exec}
1 0.000 0.000 0.000 0.000 {built-in method hasattr}
4300456 4.726 0.000 4.726 0.000 {built-in method print}
196806 0.312 0.000 0.312 0.000 {built-in method utf_8_decode}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
4300456 3.489 0.000 3.489 0.000 {method 'split' of 'str' objects}
$ zcat access.log.gz | python2 -m cProfile ./ua.py | tail -10
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 6.573 6.573 ua.py:3(<module>)
1 3.894 3.894 6.573 6.573 ua.py:6(main)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
4300456 2.680 0.000 2.680 0.000 {method 'split' of 'str' objects}
How can I avoid this overhead? Has it something to do with UTF-8?
It will be slower as you are having to perform a large number of prints, any extra processing is going to incur some performance penalty. Send item to a socket queue : the program will finish the writes first, and the console from the socket will print the output with a lag.
If your Python code runs too fast, a call to time. sleep() is a simple way to slow down your code.
The function print() takes zero or more arguments and displays them on the screen. Moreover, it takes any data. The print() is sending the data to an output device, i.e., console. Sending the data to the console is the effect caused by print() function.
Python 3 decodes data read from stdin
and encodes again to stdout
; it is not so much the print()
function that is slower here as the unicode-to-bytes conversion and vice-versa.
In your case you probably want to bypass this and deal with bytes only; you can access the underlying BufferedIOBase
implementation through the .buffer
attribute:
from sys import stdin, stdout
try:
bytes_stdin, bytes_stdout = stdin.buffer, stdout.buffer
except AttributeError:
bytes_stdin, bytes_stdout = stdin, stdout
def main():
for line in bytes_stdin:
try:
fields = line.split(b'"', 6)
bytes_stdout.write(fields[5] + b'\n')
except IndexError:
pass
if __name__ == '__main__':
main()
You'll now have to use stdout.write()
as print()
insists on writing to the stdout
TextIOBase
implementation.
Note that the .split()
now uses a bytes literal b'"'
and we write a bytes-literal b'\n'
as well (which normally would be taken care of by print()
).
The above is compatible with Python 2.6 and up. Python 2.5 doesn't support the b
prefix.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With