Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split a Python List into Chunks with Maximum Memory Size

Given a python list of bytes values:

# actual str values un-important
[
    b'foo',
    b'bar',
    b'baz',
    ...
]

How can the list be broken into chunks where each chunk has the maximum memory size below a certain ceiling?

For example: if the ceiling were 7 bytes, then the original list would be broken up into a list of lists

[
    [b'foo', b'bar'], # sublist 0
    [b'baz'], # sublist 1
    ...
]

And each sublist would be at most 7 bytes, based on accumulated length of the list's contents.

Note: each sub-list should be maximally packed, in the order of the original list. In the example above the first 2 str values were grouped because it is the maximum possible under the 7 byte limit.

Thank you in advance for your consideration and response.

like image 849
Ramón J Romero y Vigil Avatar asked Mar 31 '20 21:03

Ramón J Romero y Vigil


People also ask

How do you cut a large list in Python?

Getting a slice is O( i_2 - i_1 ). This is because Python's internal representation of a list is an array, so you can start at i_1 and iterate to i_2 . You can also look at the implementation in the CPython source if you want to.

How do you split a list into a sublist in Python?

Split List Into Sublists Using the array_split() Function in NumPy. The array_split() method in the NumPy library can also split a large array into multiple small arrays. This function takes the original array and the number of chunks we need to split the array into and returns the split chunks.

Can you split () a list in Python?

Python String split() Method The split() method splits a string into a list. You can specify the separator, default separator is any whitespace. Note: When maxsplit is specified, the list will contain the specified number of elements plus one.


4 Answers

The problem of optimal splitting of a sequence such that the elements satisfy a given max/min condition while keeping the order of the elements can be solved greedily. Hence, you need to iterate over the input sequence only once and maintain a buffer of elements. In Python this can be elegantly coded with a generator, which will have the advantage of not needing to create the result.

The bulk of the algorithm for your problem is as follows:

def split_by_size(items, max_size, get_size=len):
    buffer = []
    buffer_size = 0
    for item in items:
        item_size = get_size(item)
        if buffer_size + item_size <= max_size:
            buffer.append(item)
            buffer_size += item_size
        else:
            yield buffer
            buffer = [item]
            buffer_size = item_size
    if buffer_size > 0:
        yield buffer

where the last parameter delegates the issue of determining the size of a given item to the specified callable. I will not dwell upon this, but I will assume that a simple len() will do. Also, this assumes that each element, individually would satisfy the condition, otherwise one should handle also this case.

Testing the above code:

import random


k = 10
n = 15
max_size = 10

random.seed(0)
items = [b'x' * random.randint(1, 2 * k // 3) for _ in range(n)]
print(items)
# [b'xxxx', b'xxxx', b'x', b'xxx', b'xxxxx', b'xxxx', b'xxxx', b'xxx', b'xxxx', b'xxx', b'xxxxx', b'xx', b'xxxxx', b'xx', b'xxx']

print(list(split_by_size(items, k)))
# [[b'xxxx', b'xxxx', b'x'], [b'xxx', b'xxxxx'], [b'xxxx', b'xxxx'], [b'xxx', b'xxxx', b'xxx'], [b'xxxxx', b'xx'], [b'xxxxx', b'xx', b'xxx']]

Also, if you are willing to store the result of the split in a list anyway, the code for the above approach can be made slightly more compact:

def chunks_by_size(items, max_size, get_size=len):
    result = []
    size = max_size + 1
    for item in items:
        item_size = get_size(item)
        size += item_size
        if size > max_size:
            result.append([])
            size = item_size
        result[-1].append(item)
    return result

but also slightly slower (see benchmarks below).


You could also think of using functools.reduce() (basically the same as @NizamMohamed answer), and the code will be shorter but perhaps also less readable:

def chunks_by_size_reduce(items, size, get_size=len):
    return functools.reduce(
        lambda a, b, size=size:
            a[-1].append(b) or a
            if a and sum(get_size(x) for x in a[-1]) + get_size(b) <= size
            else a.append([b]) or a, items, [])

and certainly less efficient as get_size() is being called for every element of the "candidate" inner list for every element considered, which makes this O(n k!), k being the average number of elements in each sub-sequence. For some timings, see benchmarks below.


I would not be surprised to a solution using itertools.accumulate(), but that would also bound to be quite slow.


The simplest approach to speed things up would be to use Cython or Numba. Here, this was applied to split_by_size(). For both of them the code would be unchanged.

Benchmarking all this we obtain (_cy stands for the Cython-compiled version while _nb stands for the Numba-compiled version):

%timeit list(split_by_size(items * 100000, k + 1))
# 10 loops, best of 3: 281 ms per loop
%timeit list(split_by_size_cy(items * 100000, k + 1))
# 10 loops, best of 3: 181 ms per loop
%timeit list(split_by_size_nb(items * 100000, k + 1))
# 100 loops, best of 3: 5.17 ms per loop
%timeit chunks_by_size(items * 100000, k + 1)
# 10 loops, best of 3: 318 ms per loop
%timeit chunks_by_size_reduce(items * 100000, k + 1)
# 1 loop, best of 3: 1.18 s per loop

Note that while the Numba-compiled version is much faster than the alternatives, it is also the most brittle since it requires the forceobj flag set to True, and this may lead to unstable execution.

Anyway, I hardly believe this would be a bottleneck if the final goal is to send the grouped items through some I/O operation.


Note that the algorithm is pretty much the same as other answers, I just find the code here a bit cleaner.

like image 183
norok2 Avatar answered Oct 11 '22 23:10

norok2


This solution is with functools.reduce.

l = [b'abc', b'def', b'ghi', b'jklm', b'nopqrstuv', b'wx', b'yz']

reduce(lambda a, b, size=7: a[-1].append(b) or a if a and sum(len(x) for x in a[-1]) + len(b) <= size else a.append([b]) or a, l, [])

a is an empty list and b is an item from the original list.

if a and sum(len(x) for x in a[-1]) + len(b) <= size
check if a is not empty and sum of length of bytes in the last appended list and length of b is not exceeding size.

a[-1].append(b) or a
append b to the last appended list of a and return a if the condition is True.

a.append([b]) or a
make a list with b and append the new list to a and return a

Output;

[[b'abc', b'def'], [b'ghi', b'jklm'], [b'nopqrstuv'], [b'wx', b'yz']]
like image 29
Nizam Mohamed Avatar answered Oct 11 '22 23:10

Nizam Mohamed


The simple, naïve approach would be:

import sys
import numpy as np

# init input data - as per the comments, data type does matter, 
# for memory calculation, and for the sake of example -
# string is probably the easiest case:

lts=list("abcdefghijklmnopqrstuvwxyz")

data=[{letter: "".join(np.random.choice(lts, np.random.randint(100, 700)))} for letter in lts]

# parameters setup:

threshold=1024
buffer=[]
buffer_len=0
res_data=[]

for el in data:
    len_=sys.getsizeof(list(el.values())[0]) # I assumed it's one key, one value per dictionary (looks like this from your question) 
    if(buffer_len+len_>threshold):
        res_data.append(buffer)
        buffer=[el]
        buffer_len=len_
    else:
        buffer.append(el)
        buffer_len+=len_

if(buffer_len>0):
    res_data.append(buffer)

print(res_data)
like image 26
Grzegorz Skibinski Avatar answered Oct 12 '22 00:10

Grzegorz Skibinski


Keeping it short and sweet:

l = [b'foo', b'bar', b'baz']

thresh = 7
out = []
cur_size = 0
for x in l:
    if len(x) > thresh:
        raise ValueError("str too big")
    if cur_size + len(x) > thresh:
        cur_size = 0
    if cur_size == 0:
        out.append([])
    out[-1].append(x)
    cur_size += len(x)

print(out)

This will output:

[[b'foo', b'bar'], [b'baz']]

That should be what you want if I understood correctly. Its very simple; All it does is append the strings from the list and checks the combined size of everything in the current list it is appending to-- if the size plus the next item will be greater than the threshold, it restarts.

like image 44
xilpex Avatar answered Oct 12 '22 00:10

xilpex