Given a python <code>list</code> of <code>bytes</code> values: <pre class="prettyprint"><code># actual str values un-important [ b'foo', b'bar', b'baz', ... ] </code></pre> How can the list be broken into chunks where each chunk has the maximum memory size below a certain ceiling? For example: if the ceiling were 7 bytes, then the original list would be broken up into a list of lists <pre class="prettyprint"><code>[ [b'foo', b'bar'], # sublist 0 [b'baz'], # sublist 1 ... ] </code></pre> And each sublist would be at most 7 bytes, based on accumulated length of the list's contents. Note: each sub-list should be maximally packed, in the order of the original list. In the example above the first 2 str values were grouped because it is the maximum possible under the 7 byte limit. Thank you in advance for your consideration and response.

The problem of optimal splitting of a sequence such that the elements satisfy a given max/min condition while keeping the order of the elements can be solved greedily. Hence, you need to iterate over the input sequence only once and maintain a buffer of elements. In Python this can be elegantly coded with a generator, which will have the advantage of not needing to create the result. The bulk of the algorithm for your problem is as follows: <pre class="prettyprint"><code>def split_by_size(items, max_size, get_size=len): buffer = [] buffer_size = 0 for item in items: item_size = get_size(item) if buffer_size + item_size <= max_size: buffer.append(item) buffer_size += item_size else: yield buffer buffer = [item] buffer_size = item_size if buffer_size > 0: yield buffer </code></pre> where the last parameter delegates the issue of determining the size of a given item to the specified callable. I will not dwell upon this, but I will assume that a simple <code>len()</code> will do. Also, this assumes that each element, individually would satisfy the condition, otherwise one should handle also this case. Testing the above code: <pre class="prettyprint"><code>import random k = 10 n = 15 max_size = 10 random.seed(0) items = [b'x' * random.randint(1, 2 * k // 3) for _ in range(n)] print(items) # [b'xxxx', b'xxxx', b'x', b'xxx', b'xxxxx', b'xxxx', b'xxxx', b'xxx', b'xxxx', b'xxx', b'xxxxx', b'xx', b'xxxxx', b'xx', b'xxx'] print(list(split_by_size(items, k))) # [[b'xxxx', b'xxxx', b'x'], [b'xxx', b'xxxxx'], [b'xxxx', b'xxxx'], [b'xxx', b'xxxx', b'xxx'], [b'xxxxx', b'xx'], [b'xxxxx', b'xx', b'xxx']] </code></pre> <hr> Also, if you are willing to store the result of the split in a <code>list</code> anyway, the code for the above approach can be made slightly more compact: <pre class="prettyprint"><code>def chunks_by_size(items, max_size, get_size=len): result = [] size = max_size + 1 for item in items: item_size = get_size(item) size += item_size if size > max_size: result.append([]) size = item_size result[-1].append(item) return result </code></pre> but also slightly slower (see benchmarks below). <hr> You could also think of using <code>functools.reduce()</code> (basically the same as @NizamMohamed answer), and the code will be shorter but perhaps also less readable: <pre class="prettyprint"><code>def chunks_by_size_reduce(items, size, get_size=len): return functools.reduce( lambda a, b, size=size: a[-1].append(b) or a if a and sum(get_size(x) for x in a[-1]) + get_size(b) <= size else a.append([b]) or a, items, []) </code></pre> and certainly less efficient as <code>get_size()</code> is being called for every element of the "candidate" inner list for every element considered, which makes this <code>O(n k!)</code>, <code>k</code> being the average number of elements in each sub-sequence. For some timings, see benchmarks below. <hr> I would not be surprised to a solution using <code>itertools.accumulate()</code>, but that would also bound to be quite slow. <hr> The simplest approach to speed things up would be to use Cython or Numba. Here, this was applied to <code>split_by_size()</code>. For both of them the code would be unchanged. Benchmarking all this we obtain (<code>_cy</code> stands for the Cython-compiled version while <code>_nb</code> stands for the Numba-compiled version): <pre class="prettyprint"><code>%timeit list(split_by_size(items * 100000, k + 1)) # 10 loops, best of 3: 281 ms per loop %timeit list(split_by_size_cy(items * 100000, k + 1)) # 10 loops, best of 3: 181 ms per loop %timeit list(split_by_size_nb(items * 100000, k + 1)) # 100 loops, best of 3: 5.17 ms per loop %timeit chunks_by_size(items * 100000, k + 1) # 10 loops, best of 3: 318 ms per loop %timeit chunks_by_size_reduce(items * 100000, k + 1) # 1 loop, best of 3: 1.18 s per loop </code></pre> Note that while the Numba-compiled version is much faster than the alternatives, it is also the most brittle since it requires the <code>forceobj</code> flag set to <code>True</code>, and this may lead to unstable execution. Anyway, I hardly believe this would be a bottleneck if the final goal is to send the grouped items through some I/O operation. <hr> Note that the algorithm is pretty much the same as other answers, I just find the code here a bit cleaner.

This solution is with <code>functools.reduce</code>. <pre class="prettyprint"><code>l = [b'abc', b'def', b'ghi', b'jklm', b'nopqrstuv', b'wx', b'yz'] reduce(lambda a, b, size=7: a[-1].append(b) or a if a and sum(len(x) for x in a[-1]) + len(b) <= size else a.append([b]) or a, l, []) </code></pre> <code>a</code> is an empty <code>list</code> and <code>b</code> is an item from the original <code>list</code>. <code>if a and sum(len(x) for x in a[-1]) + len(b) <= size</code> check if <code>a</code> is not empty and sum of length of <code>bytes</code> in the last appended <code>list</code> and length of <code>b</code> is not exceeding <code>size</code>. <code>a[-1].append(b) or a</code> append <code>b</code> to the last appended <code>list</code> of <code>a</code> and return <code>a</code> if the condition is <code>True</code>. <code>a.append([b]) or a</code> make a <code>list</code> with <code>b</code> and append the new <code>list</code> to <code>a</code> and return <code>a</code> Output; <pre class="prettyprint"><code>[[b'abc', b'def'], [b'ghi', b'jklm'], [b'nopqrstuv'], [b'wx', b'yz']] </code></pre>

Keeping it short and sweet: <pre class="prettyprint lang-py prettyprint-override"><code>l = [b'foo', b'bar', b'baz'] thresh = 7 out = [] cur_size = 0 for x in l: if len(x) > thresh: raise ValueError("str too big") if cur_size + len(x) > thresh: cur_size = 0 if cur_size == 0: out.append([]) out[-1].append(x) cur_size += len(x) print(out) </code></pre> This will output: <pre class="prettyprint"><code>[[b'foo', b'bar'], [b'baz']] </code></pre> That should be what you want if I understood correctly. Its very simple; All it does is append the strings from the list and checks the combined size of everything in the current list it is appending to-- if the size plus the next item will be greater than the threshold, it restarts.

Split a Python List into Chunks with Maximum Memory Size

Tags:

python

python-3.x

Given a python list of bytes values:

# actual str values un-important
[
    b'foo',
    b'bar',
    b'baz',
    ...
]

How can the list be broken into chunks where each chunk has the maximum memory size below a certain ceiling?

For example: if the ceiling were 7 bytes, then the original list would be broken up into a list of lists

[
    [b'foo', b'bar'], # sublist 0
    [b'baz'], # sublist 1
    ...
]

And each sublist would be at most 7 bytes, based on accumulated length of the list's contents.

Note: each sub-list should be maximally packed, in the order of the original list. In the example above the first 2 str values were grouped because it is the maximum possible under the 7 byte limit.

Thank you in advance for your consideration and response.

849

asked Mar 31 '20 21:03

Ramón J Romero y Vigil

4 Answers

The problem of optimal splitting of a sequence such that the elements satisfy a given max/min condition while keeping the order of the elements can be solved greedily. Hence, you need to iterate over the input sequence only once and maintain a buffer of elements. In Python this can be elegantly coded with a generator, which will have the advantage of not needing to create the result.

The bulk of the algorithm for your problem is as follows:

def split_by_size(items, max_size, get_size=len):
    buffer = []
    buffer_size = 0
    for item in items:
        item_size = get_size(item)
        if buffer_size + item_size <= max_size:
            buffer.append(item)
            buffer_size += item_size
        else:
            yield buffer
            buffer = [item]
            buffer_size = item_size
    if buffer_size > 0:
        yield buffer

where the last parameter delegates the issue of determining the size of a given item to the specified callable. I will not dwell upon this, but I will assume that a simple len() will do. Also, this assumes that each element, individually would satisfy the condition, otherwise one should handle also this case.

Testing the above code:

import random


k = 10
n = 15
max_size = 10

random.seed(0)
items = [b'x' * random.randint(1, 2 * k // 3) for _ in range(n)]
print(items)
# [b'xxxx', b'xxxx', b'x', b'xxx', b'xxxxx', b'xxxx', b'xxxx', b'xxx', b'xxxx', b'xxx', b'xxxxx', b'xx', b'xxxxx', b'xx', b'xxx']

print(list(split_by_size(items, k)))
# [[b'xxxx', b'xxxx', b'x'], [b'xxx', b'xxxxx'], [b'xxxx', b'xxxx'], [b'xxx', b'xxxx', b'xxx'], [b'xxxxx', b'xx'], [b'xxxxx', b'xx', b'xxx']]

Also, if you are willing to store the result of the split in a list anyway, the code for the above approach can be made slightly more compact:

def chunks_by_size(items, max_size, get_size=len):
    result = []
    size = max_size + 1
    for item in items:
        item_size = get_size(item)
        size += item_size
        if size > max_size:
            result.append([])
            size = item_size
        result[-1].append(item)
    return result

but also slightly slower (see benchmarks below).

You could also think of using functools.reduce() (basically the same as @NizamMohamed answer), and the code will be shorter but perhaps also less readable:

def chunks_by_size_reduce(items, size, get_size=len):
    return functools.reduce(
        lambda a, b, size=size:
            a[-1].append(b) or a
            if a and sum(get_size(x) for x in a[-1]) + get_size(b) <= size
            else a.append([b]) or a, items, [])

and certainly less efficient as get_size() is being called for every element of the "candidate" inner list for every element considered, which makes this O(n k!), k being the average number of elements in each sub-sequence. For some timings, see benchmarks below.

I would not be surprised to a solution using itertools.accumulate(), but that would also bound to be quite slow.

The simplest approach to speed things up would be to use Cython or Numba. Here, this was applied to split_by_size(). For both of them the code would be unchanged.

Benchmarking all this we obtain (_cy stands for the Cython-compiled version while _nb stands for the Numba-compiled version):

%timeit list(split_by_size(items * 100000, k + 1))
# 10 loops, best of 3: 281 ms per loop
%timeit list(split_by_size_cy(items * 100000, k + 1))
# 10 loops, best of 3: 181 ms per loop
%timeit list(split_by_size_nb(items * 100000, k + 1))
# 100 loops, best of 3: 5.17 ms per loop
%timeit chunks_by_size(items * 100000, k + 1)
# 10 loops, best of 3: 318 ms per loop
%timeit chunks_by_size_reduce(items * 100000, k + 1)
# 1 loop, best of 3: 1.18 s per loop

Note that while the Numba-compiled version is much faster than the alternatives, it is also the most brittle since it requires the forceobj flag set to True, and this may lead to unstable execution.

Anyway, I hardly believe this would be a bottleneck if the final goal is to send the grouped items through some I/O operation.

Note that the algorithm is pretty much the same as other answers, I just find the code here a bit cleaner.

183

answered Oct 11 '22 23:10

norok2

This solution is with functools.reduce.

l = [b'abc', b'def', b'ghi', b'jklm', b'nopqrstuv', b'wx', b'yz']

reduce(lambda a, b, size=7: a[-1].append(b) or a if a and sum(len(x) for x in a[-1]) + len(b) <= size else a.append([b]) or a, l, [])

a is an empty list and b is an item from the original list.

if a and sum(len(x) for x in a[-1]) + len(b) <= size
check if a is not empty and sum of length of bytes in the last appended list and length of b is not exceeding size.

a[-1].append(b) or a
append b to the last appended list of a and return a if the condition is True.

a.append([b]) or a
make a list with b and append the new list to a and return a

Output;

[[b'abc', b'def'], [b'ghi', b'jklm'], [b'nopqrstuv'], [b'wx', b'yz']]

answered Oct 11 '22 23:10

Nizam Mohamed

The simple, naïve approach would be:

import sys
import numpy as np

# init input data - as per the comments, data type does matter, 
# for memory calculation, and for the sake of example -
# string is probably the easiest case:

lts=list("abcdefghijklmnopqrstuvwxyz")

data=[{letter: "".join(np.random.choice(lts, np.random.randint(100, 700)))} for letter in lts]

# parameters setup:

threshold=1024
buffer=[]
buffer_len=0
res_data=[]

for el in data:
    len_=sys.getsizeof(list(el.values())[0]) # I assumed it's one key, one value per dictionary (looks like this from your question) 
    if(buffer_len+len_>threshold):
        res_data.append(buffer)
        buffer=[el]
        buffer_len=len_
    else:
        buffer.append(el)
        buffer_len+=len_

if(buffer_len>0):
    res_data.append(buffer)

print(res_data)

answered Oct 12 '22 00:10

Grzegorz Skibinski

Keeping it short and sweet:

l = [b'foo', b'bar', b'baz']

thresh = 7
out = []
cur_size = 0
for x in l:
    if len(x) > thresh:
        raise ValueError("str too big")
    if cur_size + len(x) > thresh:
        cur_size = 0
    if cur_size == 0:
        out.append([])
    out[-1].append(x)
    cur_size += len(x)

print(out)

This will output:

[[b'foo', b'bar'], [b'baz']]

That should be what you want if I understood correctly. Its very simple; All it does is append the strings from the list and checks the combined size of everything in the current list it is appending to-- if the size plus the next item will be greater than the threshold, it restarts.

answered Oct 12 '22 00:10

xilpex

Related questions
                            
                                Run multiple models of an ensemble in parallel with PyTorch
                            
                                Flask - What is difference when returning jsonify vs dict?
                            
                                Hausdorff distance for large dataset in a fastest way
                            
                                Why do I get NameError: name 'train_test_split' is not defined? [closed]
                            
                                How to get x axis labels in multiple line in matplotlib?
                            
                                Share files with other collaborators in google colab
                            
                                Vectorize a for loop in numpy to calculate duct-tape overlaping
                            
                                Get all keys and its hierarchy in h5 file using python library h5py
                            
                                Asynchronous server messages with python flask
                            
                                find_peaks does not identify a peak at the start of the array
                            
                                Explanation of generator.close() with exception handling
                            
                                Making a Python test think an installed package is not available
                            
                                How to run async function in Airflow?
                            
                                This PaymentMethod was previously used without being attached to a Customer or was detached from a Customer, and may not be used again
                            
                                Django AWS Elastic Beanstalk error ImproperlyConfigured: Error loading MySQLdb module: No module named MySQLdb
                            
                                Prevent f string from converting float into scientific notation
                            
                                Can't Install psutil
                            
                                Increasing each element of a tensor by the predecessor in Tensorflow 2.0
                            
                                pymongo.errors.ServerSelectionTimeoutError:localhost:27017:[WinError 10061]No connection could be made because the target machine actively refused it
                            
                                Colorize the background of a seaborn plot using a column in dataframe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With