Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to fix the constantly growing memory usage of ray?

Tags:

python

memory

ray

I started using ray for distributed machine learning and I already have some issues. The memory usage is simply growing until the program crashes. Altough I clear the list constantly, the memory is somehow leaking. Any idea why ?

My specs: OS Platform and Distribution: Ubuntu 16.04 Ray installed from: binary Ray version: 0.6.5 Python version:3.6.8

I already tried using the experimental queue instead of the DataServer class, but the problem is still the same.

import numpy as np
import ray
import time
ray.init(redis_max_memory=100000000)


@ray.remote
class Runner():
    def __init__(self, dataList):
        self.run(dataList)

    def run(self,dataList):
        while True:
            dataList.put.remote(np.ones(10))

@ray.remote
class Optimizer():
    def __init__(self, dataList):
        self.optimize(dataList)

    def optimize(self,dataList):
        while True:
            dataList.pop.remote()

@ray.remote
class DataServer():
    def __init__(self):
        self.dataList= []

    def put(self,data):
        self.dataList.append(data)

    def pop(self):
        if len(self.dataList) !=0:
            return self.dataList.pop()
    def get_size(self):
        return len(self.dataList)


dataServer = DataServer.remote()
runner = Runner.remote(dataServer)
optimizer1 = Optimizer.remote(dataServer)
optimizer2 = Optimizer.remote(dataServer)

while True:
    time.sleep(1)
    print(ray.get(dataServer.get_size.remote()))

After running for some time I get this error message:

like image 478
TRZUKLO Avatar asked Apr 18 '19 15:04

TRZUKLO


2 Answers

I recently ran into a similar problem and found that if you are frequently putting large objects (using ray.put()) that you need to either:

  1. Manually either adjust the thresholds that the python garbage collector uses

  2. Call the gc.collect() on a regular basis.

I implemented a method that checks the amount of used memory and then calls the garbage collector.

The problem is that the default thresholds are based upon the # of objects, but if you are putting large objects, the gc may never get called until you run out of memory. My utility method is as follows:

def auto_garbage_collect(pct=80.0):
    """
    auto_garbage_collection - Call the garbage collection if memory used is greater than 80% of total available memory.
                              This is called to deal with an issue in Ray not freeing up used memory.

        pct - Default value of 80%.  Amount of memory in use that triggers the garbage collection call.
    """
    if psutil.virtual_memory().percent >= pct:
        gc.collect()
    return

Calling this will solve the problem when it is related pushing large objects via ray.put() and running out of memory.

like image 120
Michael Wade Avatar answered Oct 27 '22 12:10

Michael Wade


A quick fix is to use:

    ray.shutdown()

I code in Spyder which displays the percentage of memory used in the bottom right corner. When I run the same script multiple times, I noticed that the memory percentage value increased in increments of 3% (based on the 8 gigs RAM I have). This made me wonder if ray was storing something like a session due to the increments (each one corresponding to a session).

It turns out that it does.

ray.shutdown() ends the session. However, you need to call ray.init() again if you want to run your script again. Also, make sure you place this in the correct location as to not end ray while it is still needed.

This solves the problem of increasing memory usage with running a script several times.

I do not know Ray very well but, ray.init() has various arguments relating to addresses of sorts. I am sure there must be a way to make ray run on the same session via one of these arguments. This is speculation. I have not attempted any of this yet. Perhaps you can figure this out?

like image 31
Dylan Solms Avatar answered Oct 27 '22 12:10

Dylan Solms