Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Expand and flatten a ragged nested list

I know that the topic of flattening a nested list has been covered in great detail before, however I think my task is a bit different and I couldn't find any info.

I am writing a scraper, and as output I get a nested list. The top level list elements are supposed to become rows for data in spreadsheet form. However, since the nested lists are often of different lengths, I need to expand them before flattening the list.

Here's an example. I have

   [ [ "id1", [["x", "y", "z"], [1, 2]],    ["a", "b", "c"]],
     [ "id2", [["x", "y", "z"], [1, 2, 3]], ["a", "b"]],
     [ "id3", [["x", "y"],      [1, 2, 3]], ["a", "b", "c", ""]] ]

The output I ultimately want is

   [[ "id1", "x", "y",  z, 1, 2, "", "a", "b", "c", ""],
    [ "id2", "x", "y",  z, 1, 2,  3, "a", "b",  "", ""],
    [ "id3", "x", "y", "", 1, 2,  3, "a", "b", "c", ""]]

However, an intermediate list like this

   [ [ "id1", [["x", "y", "z"], [1, 2, ""]], ["a", "b", "c", ""]],
     [ "id2", [["x", "y", "z"], [1, 2,  3]], ["a", "b",  "", ""]],
     [ "id3", [["x", "y",  ""], [1, 2,  3]], ["a", "b", "c", ""]] ]

which I can then simply flatten would also be fine.

The top-level list elements (rows) are built in every iteration, and appended to the full list. I guess it is easier to transform the full list at the end?

The structure in which elements are nested should be the same, however I cannot be certain of it at this point. I guess I have a problem if the structure where to look like this.

   [ [ "id1", [[x, y, z], [1, 2]],             ["a", "b", "c"]],
     [ "id2", [[x, y, z], [1, 2, 3]], ["bla"], ["a", "b"]],
     [ "id3", [[x, y],    [1, 2, 3]],          ["a", "b", "c", ""]] ]

which should become

   [[ "id1", x, y,  z, 1, 2, "",    "", "a", "b", "c", ""],
    [ "id2", x, y,  z, 1, 2,  3, "bla", "a", "b",  "", ""],
    [ "id3", x, y, "", 1, 2,  3,    "", "a", "b", "c", ""]]

Thanks for any comments, and please excuse if this is trivial, I am rather new to Python.

like image 325
ilprincipe Avatar asked Jan 16 '23 19:01

ilprincipe


2 Answers

I've got a simple solution for the "same structure" case, using a recursive generator and the izip_longest function from itertools. This code is for Python 2, but with a few tweaks (noted in comments) it can be made to work on Python 3:

from itertools import izip_longest # in py3, this is renamed zip_longest

def flatten(nested_list):
    return zip(*_flattengen(nested_list)) # in py3, wrap this in list()

def _flattengen(iterable):
    for element in izip_longest(*iterable, fillvalue=""):
        if isinstance(element[0], list):
            for e in _flattengen(element):
                yield e
        else:
            yield element

In Python 3.3 it will become even simpler, thanks to PEP 380 which will allow the recursive step, for e in _flatengen(element): yield e, to become yield from _flattengen(element).

like image 181
Blckknght Avatar answered Jan 28 '23 03:01

Blckknght


Actually there is no the solution for generic case where the structure is not the same. For example a normal algorithm would match ["bla"] with ["a", "b", "c"], and the result will be

 [  [ "id1", x, y,  z, 1, 2, "",   "a", "b", "c", "",  "",  ""],
    [ "id2", x, y,  z, 1, 2,  3, "bla",  "",  "", "", "a", "b"],
    [ "id3", x, y, "", 1, 2,  3,   "a", "b", "c", "",  "",  ""]]

But if you know you will have a number of rows, each starting with an ID an followed by a nested list structure, the algorithm below should work:

import itertools

def normalize(l):
    # just hack the first item to have only lists of lists or lists of items
    for sublist in l:
        sublist[0] = [sublist[0]]

    # break the nesting
    def flatten(l):
        for item in l:
            if not isinstance(item, list) or 0 == len([x for x in item if isinstance(x, list)]):
                yield item
            else:
                for subitem in flatten(item):
                    yield subitem

    l = [list(flatten(i)) for i in l]

    # extend all lists to greatest length
    list_lengths = { }
    for i in range(0, len(l[0])):
        for item in l:
            list_lengths[i] = max(len(item[i]), list_lengths.get(i, 0))

    for i in range(0, len(l[0])):
        for item in l:
            item[i] += [''] * (list_lengths[i] - len(item[i]))

    # flatten each row
    return [list(itertools.chain(*sublist)) for sublist in l]

l = [ [ "id1", [["x", "y", "z"], [1, 2]],    ["a", "b", "c"]],
      [ "id2", [["x", "y", "z"], [1, 2, 3]], ["a", "b"]],
      [ "id3", [["x", "y"],      [1, 2, 3]], ["a", "b", "c", ""]] ]
l = normalize(l)
print l
like image 34
Mihai Stan Avatar answered Jan 28 '23 02:01

Mihai Stan