Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to get union of lists - Python

There's a C++ comparison to get union of lists from lists of lists: The fastest way to find union of sets

And there's several other python related questions but none suggest the fastest way to unionize the lists:

  • Finding a union of lists of lists in Python
  • Flattening a shallow list in Python

From the answers, I've gathered that there are at least 2 ways to do it:

>>> from itertools import chain
>>> x = [[1,2,3], [3,4,5], [1,7,8]]
>>> list(set().union(*x))
[1, 2, 3, 4, 5, 7, 8]
>>> list(set(chain(*x)))
[1, 2, 3, 4, 5, 7, 8]

Note that I'm casting the set to list afterwards because I need the order of the list to be fixed for further processing.

After some comparison, it seems like list(set(chain(*x))) is more stable and takes less time:

from itertools import chain
import time
import random

# Dry run.
x = [[random.choice(range(10000)) 
    for i in range(10)] for j in range(10)]
list(set().union(*x))
list(set(chain(*x)))

y_time = 0
z_time = 0

for _ in range(1000):
    x = [[random.choice(range(10000)) 
        for i in range(10)] for j in range(10)]
    start = time.time()
    y = list(set().union(*x))
    y_time += time.time() - start 
    #print 'list(set().union(*x)):\t', y_time
    start = time.time()
    z = list(set(chain(*x)))
    z_time += time.time() - start 
    #print 'list(set(chain(*x))):\t', z_time
    assert sorted(y) == sorted(z)
    #print 

print y_time / 1000.
print z_time / 1000. 

[out]:

1.39586925507e-05
1.09834671021e-05

Taking out the variable of casting sets to list:

y_time = 0
z_time = 0

for _ in range(1000):
    x = [[random.choice(range(10000)) 
        for i in range(10)] for j in range(10)]

    start = time.time()
    y = set().union(*x)
    y_time += time.time() - start 

    start = time.time()
    z = set(chain(*x))
    z_time += time.time() - start 

    assert sorted(y) == sorted(z)

print y_time / 1000.
print z_time / 1000. 

[out]:

1.22241973877e-05
1.02684497833e-05

Here's the full output when I try to print the intermediate timings (without list casting): http://pastebin.com/raw/y3i6dXZ8

Why is it that list(set(chain(*x))) takes less time than list(set().union(*x))?

Is there another way of achieving the same union of lists? Using numpy or pandas or sframe or something? Is the alternative faster?

like image 674
alvas Avatar asked Mar 08 '16 11:03

alvas


1 Answers

What's fastest depends on the nature of x -- whether it is a long list or a short list, with many sublists or few sublists, whether the sublists are long or short, and whether there are many duplicates or few duplicates.

Here are some timeit results comparing some alternatives. There are so many possibilities that this is by no means a complete analysis, but perhaps this will give you a framework for studying your use case.

func                 | x                    | time
unique_concatenate   | many_uniques         | 0.863
empty_set_union      | many_uniques         | 1.191
short_set_union_rest | many_uniques         | 1.192
long_set_union_rest  | many_uniques         | 1.194
set_chain            | many_uniques         | 1.224

func                 | x                    | time
long_set_union_rest  | many_duplicates      | 0.958
short_set_union_rest | many_duplicates      | 0.969
empty_set_union      | many_duplicates      | 0.971
set_chain            | many_duplicates      | 1.128
unique_concatenate   | many_duplicates      | 2.411

func                 | x                    | time
empty_set_union      | many_small_lists     | 1.023
long_set_union_rest  | many_small_lists     | 1.028
set_chain            | many_small_lists     | 1.032
short_set_union_rest | many_small_lists     | 1.036
unique_concatenate   | many_small_lists     | 1.351

func                 | x                    | time
long_set_union_rest  | few_large_lists      | 0.791
empty_set_union      | few_large_lists      | 0.813
unique_concatenate   | few_large_lists      | 0.814
set_chain            | few_large_lists      | 0.829
short_set_union_rest | few_large_lists      | 0.849

Be sure to run the timeit benchmarks on your own machine since results may vary.


from __future__ import print_function
import random
import timeit
from itertools import chain
import numpy as np

def unique_concatenate(x):
    return np.unique(np.concatenate(x))

def short_set_union_rest(x):
    # This assumes x[0] is the shortest list in x
    return list(set(x[0]).union(*x[1:]))

def long_set_union_rest(x):
    # This assumes x[-1] is the longest list in x
    return list(set(x[-1]).union(*x[1:]))

def empty_set_union(x):
    return list(set().union(*x))

def set_chain(x):
    return list(set(chain(*x)))


big_range = list(range(10**7))
small_range = list(range(10**5))
many_uniques = [[random.choice(big_range) for i in range(j)] 
                for j in range(10, 10000, 10)]
many_duplicates = [[random.choice(small_range) for i in range(j)] 
              for j in range(10, 10000, 10)]
many_small_lists = [[random.choice(big_range) for i in range(10)] 
                    for j in range(10, 10000, 10)]
few_large_lists = [[random.choice(big_range) for i in range(1000)] 
                    for j in range(10, 100, 10)]

if __name__=='__main__':
    for x, n in [('many_uniques', 1), ('many_duplicates', 4), 
                 ('many_small_lists', 800), ('few_large_lists', 800)]:
        timing = dict()
        for func in [
                'unique_concatenate', 'short_set_union_rest', 'long_set_union_rest',
                'empty_set_union', 'set_chain']:
            timing[func, x] = timeit.timeit(
                '{}({})'.format(func, x), number=n,
                setup='from __main__ import {}, {}'.format(func, x))
        print('{:20} | {:20} | {}'.format('func', 'x', 'time'))
        for key, t in sorted(timing.items(), key=lambda item: item[1]):
            func, x = key
            print('{:20} | {:20} | {:.3f}'.format(func, x, t))
        print(end='\n')
like image 81
unutbu Avatar answered Oct 14 '22 10:10

unutbu