PyPy large memory usage compared to CPython

Tags:

I used python to solve SPOJ's large input test problem and met with a very strange occurrence. I submitted the same code using PyPy and Python 2. The results are shown below:

spoj large input test

The code ran much faster using PyPy compared to CPython, as expected. But at the same time, the memory usage increased by a whopping 7 times! I did a search on the web but I was unable to find any evidence that suggests that PyPy's memory usage is much more than CPython. Could somone please explain the huge difference in memory usage?

I have also considered that it could be because of my code. Hence, I have posted my code below:

import io, sys, atexit, os
sys.stdout = io.BytesIO()
atexit.register(lambda: sys.__stdout__.write(sys.stdout.getvalue()))
sys.stdin = io.BytesIO(sys.stdin.read())
raw_input = lambda: sys.stdin.readline().rstrip()

line = list(map(int,raw_input().split()))
num, k = line
ans = 0

for i in xrange(0,num):
    if int(raw_input())%k == 0:
        ans += 1;

print(ans)

Could someone please advise me?

476

asked Jul 15 '17 11:07

Donald

1 Answers

First, I was not able to reproduce the results. Don't know which versions/set-ups are used by SPOJ. For the following experiments, PyPy 5.8.0 and CPython 2.7.12 were used.

As test case, the largest possible input file with size of about 110MB was used:

#create_data.py
print 10**6, 33
for i in xrange(10**6):
  print 10**9

>> python create_data.py > input.in

Now running /usr/bin/time -v XXX solution.py < input.py yields:

Interpreter     MaximalResidentSize 
PyPy:                 278 Mb
CPython:              222 Mb

PyPy needs a little bit more memory. CPython and PyPy use different garbage collector strategies and I think PyPy's trade-off is to be faster but to use more memory. The guys from PyPy have a great article about their garbage collector and its comparison to CPython.

Second, I don't trust the numbers from the SPJO-site. system.stdin.read() will read the whole file into memory. The python documentation even says:

To read a file’s contents, call f.read(size), which reads some quantity of data and returns it as a string. size is an optional numeric argument. When size is omitted or negative, the entire contents of the file will be read and returned; it’s your problem if the file is twice as large as your machine’s memory.

Under the assumption, that the worst case was include into their test cases, the memory usage should be at least the size of the file (110 MB) as you use std.stdin.read() and even twice the size, because you are coping the data.

Actually, I'm not sure, the whole trouble is worth it - using raw_input() is probably fast enough - I would just trust python to do The Right Thing. CPython normally buffers stdout and stdin (fully buffered if they are redirected to files, or line-buffered for the console) and you have to use command line option -u to switch it off.

But if you really wanna be sure, you can use the file-object iterators of sys.stdin, because as CPython man pages state:

-u Force stdin, stdout and stderr to be totally unbuffered. On systems where it matters, also put stdin, stdout and stderr in binary mode. Note that there is internal buffering in xread‐ lines(), readlines() and file-object iterators ("for line in sys.stdin") which is not influenced by this option. To work around this, you will want to use "sys.stdin.readline()" inside a "while 1:" loop.

That means your program could look like this:

import sys
num, k = map(int,raw_input().split())
ans = 0    
for line in sys.stdin:
    if int(line)%k == 0:
        ans += 1
print(ans)

This has the big advantage that only around 7MB memory are used for this variant.

Another lessons is that you should not use sys.stdin.readline() if your are afraid, that somebody runs your program in the unbuffered mode.

some further experiments (with my cpu clocked down)

                   CPython        CPython -u         PyPy         PyPy -u
original        28sec/221MB      25sec/221MB       3sec/278MB    3sec/278MB
raw_input()     29sec/7MB        110sec/7MB        7sec/75MB    100sec/63MB
readline()     38sec/7MB        130sec/7MB        5sec/75MB    100sec/63MB
readlines()    20sec/560MB      20sec/560MB       4sec/1.4GB    4sec/1.4G
file-iterator    17sec/7MB       17sec/7MB         4sec/68MB    100sec/62MB

There are some takeaways:

raw_input() and sys.stdin.read_line() have identical performances
raw_input() is buffered, but this buffer seems to be a little bit different as the buffer for file-object iterator, which outperforms raw_input() at least for this file.
memory-overhead of sys.stdin.readlines() seems to be pretty hight, at least as long as the lines are short.
file-object iterator has different behavior in CPython and PyPy, if option -u is used: for PyPy -u switches off also the buffering for file-object iterator (maybe a bug?).

answered Oct 22 '22 13:10

ead

Related questions
                            
                                How to get the percent change of values in a dataframe while caring about NaN values?
                            
                                stdout format changing when sending a file using smtplib - python2.7
                            
                                Python IP changing
                            
                                Combine multiple styles in pandas
                            
                                Get params sent to a subcommand of a click.group()
                            
                                Warning from keras: "Update your Conv2D call to the Keras 2 API"
                            
                                Unable to see Measurement protocol hit in GA reports (using Python Request library)
                            
                                Django primarykey serializer "This field may not be null" while allow_null=True
                            
                                asyncio version of `os.chmod` in Python
                            
                                matplotlib: match legend text color with symbol in scatter plot
                            
                                Filtering Django ModelViewSet List From Url Parameter
                            
                                Unable to use pip inside jenkins
                            
                                Python openCV matchTemplate on grayscale image with masking
                            
                                Plotting Networkx graph in Python
                            
                                How do you map a 3d matrix to color values in a 3d scatter plot using matplotlib?
                            
                                matplotlib: remove 3D plot's white spaces in mixed 2D/3D subplots
                            
                                Override `import` for more sophisticated module import
                            
                                Keras 2, TypeError: can't pickle _thread.lock objects
                            
                                Why do Keras Conv1D layers' output tensors not have the input dimension?
                            
                                Calculate property only once and use the result several times (different approaches)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

PyPy large memory usage compared to CPython

Tags:

python

io

cpython

pypy

Donald

People also ask

1 Answers

ead

Recent Activity

Donate For Us