I have list of lists and would like to create data frame with count of all unique elements. Here is my test data:
test = [["P1", "P1", "P1", "P2", "P2", "P1", "P1", "P3"],
["P1", "P1", "P1"],
["P1", "P1", "P1", "P2"],
["P4"],
["P1", "P4", "P2"],
["P1", "P1", "P1"]]
I can do something like this using Counter
with for
loop as:
from collections import Counter
for item in test:
print(Counter(item))
But how can I have result of this loop summed up into new data frame ?
Expected output as data frame:
P1 P2 P3 P4
15 4 1 2
Use List comprehension to count elements in list of lists. Iterate over the list of lists using List comprehension. Build a new list of sizes of internal lists. Then pass the list to sum() to get total number of elements in list of lists i.e.
The most straightforward way to get the number of elements in a list is to use the Python built-in function len() . As the name function suggests, len() returns the length of the list, regardless of the types of elements in it.
We can find sum of each column of the given nested list using zip function of python enclosing it within list comprehension. Another approach is to use map(). We apply the sum function to each element in a column and find sum of each column accordingly.
If you want to count multiple items in a list, you can call count() in a loop. This approach, however, requires a separate pass over the list for every count() call; which can be catastrophic for performance. Use couter() method from class collections , instead.
Here is one way.
from collections import Counter
from itertools import chain
test = [["P1", "P1", "P1", "P2", "P2", "P1", "P1", "P3"],
["P1", "P1", "P1"],
["P1", "P1", "P1", "P2"],
["P4"],
["P1", "P4", "P2"],
["P1", "P1", "P1"]]
c = Counter(chain.from_iterable(test))
for k, v in c.items():
print(k, v)
# P1 15
# P2 4
# P3 1
# P4 2
For output as dataframe:
df = pd.DataFrame.from_dict(c, orient='index').transpose()
# P1 P2 P3 P4
# 0 15 4 1 2
In terms of better performance, you should be either using:
collections.Counter
with itertools.chain.from_iterable
as:
>>> from collections import Counter
>>> from itertools import chain
>>> Counter(chain.from_iterable(test))
Counter({'P1': 15, 'P2': 4, 'P4': 2, 'P3': 1})
OR, yo should be using collections.Counter
with list comprehension (requires one less import of itertools
with same performance) as:
>>> from collections import Counter
>>> Counter([x for a in test for x in a])
Counter({'P1': 15, 'P2': 4, 'P4': 2, 'P3': 1})
Keep reading for more alternative solutions and the performance comparison. (skip otherwise)
Approach 1: Concatenate your sublists to create the single list
and find the count using collections.Counter
.
Solution 1: Concatenate list using itertools.chain.from_iterable
and find the count using collections.Counter
as:
test = [
["P1", "P1", "P1", "P2", "P2", "P1", "P1", "P3"],
["P1", "P1", "P1"],
["P1", "P1", "P1", "P2"],
["P4"],
["P1", "P4", "P2"],
["P1", "P1", "P1"]
]
from itertools import chain
from collections import Counter
my_counter = Counter(chain.from_iterable(test))
Solution 2: Combine list using list comprehension as:
from collections import Counter
my_counter = Counter([x for a in my_list for x in a])
Solution 3: Concatenate list using sum
from collections import Counter
my_counter = Counter(sum(test, []))
Approach 2: Calculate count of elements in each sublist using collections.Counter
and then sum
the Counter
objects in the list.
Solution 4: Count objects of each sublist using collections.Counter
and map
as:
from collections import Counter
my_counter = sum(map(Counter, test), Counter())
Solution 5: Count objects of each sublist using list comprehension as:
from collections import Counter
my_counter = sum([Counter(t) for t in test], Counter())
In all the solutions above, my_counter
will hold the value:
>>> my_counter
Counter({'P1': 15, 'P2': 4, 'P4': 2, 'P3': 1})
Below is the timeit
comparison on Python 3 for the list of 1000 sublist and 100 elements in each sublist:
Fastest using chain.from_iterable
(17.1 msec)
mquadri$ python3 -m timeit "from collections import Counter; from itertools import chain; my_list = [list(range(100)) for i in range(1000)]" "Counter(chain.from_iterable(my_list))"
100 loops, best of 3: 17.1 msec per loop
Second on the list is using list comprehension to combine the list and then do the Count
(similar result as above but without the additional import of itertools
) (18.36 msec)
mquadri$ python3 -m timeit "from collections import Counter; my_list = [list(range(100)) for i in range(1000)]" "Counter([x for a in my_list for x in a])"
100 loops, best of 3: 18.36 msec per loop
Third in terms of performance is using Counter
on sublists within list comprehension : (162 msec)
mquadri$ python3 -m timeit "from collections import Counter; my_list = [list(range(100)) for i in range(1000)]" "sum([Counter(t) for t in my_list], Counter())"
10 loops, best of 3: 162 msec per loop
Fourth on the list is via using Counter
with map
(results are quite similar to the one using list comprehension above) (176 msec)
mquadri$ python3 -m timeit "from collections import Counter; my_list = [list(range(100)) for i in range(1000)]" "sum(map(Counter, my_list), Counter())"
10 loops, best of 3: 176 msec per loop
Solution using sum
to concatenate the list is too slow (526 msec)
mquadri$ python3 -m timeit "from collections import Counter; my_list = [list(range(100)) for i in range(1000)]" "Counter(sum(my_list, []))"
10 loops, best of 3: 526 msec per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With