Count elements, then remove duplicates

Question

So I found out that the easiest way of grouping and counting elements is through itertools.

I have this list of "Employee Departments" (e.g. Accounting, Purchasing, Marketing, etc.) and it's over 500. A sample of which is:

# employee number, first name, last name, department, rate, age, birthdate

201601005,Raylene,Kampa,Purchasing,365,15,12/19/2001,; 
200909005,Flo,Bookamer,Human Resources,800,28,12/19/1957,; 
200512016,Jani,Biddy,Human Resources,565,20,8/7/1966,; 
199806004,Chauncey,Motley,Admin,450,24,3/1/2000

What I intend to do is count all employees under a certain department then remove the duplicates. It should be looking like (for example):

Accounting: 97
Marketing: 34
Purchasing: 45

The list is implied as a module so I can't use CSV to read it. The following is my code for the itertools:

import empDataLT as x
from itertools import groupby

#Departments
def dept():
    empDept = list() #converting empDataLT to list
    for em in x.a:
        empEm = em.strip().split(",")
        empDept.append(empEm)
    e = sorted(empDept, key=lambda x: x[3]) #sort data alphabetical
    b = []
    c = []
    for s in e:
        new_b = []
        new_c = []
        for value, repeated in groupby(s[3]):
            new_b.append(value)
            new_c.append(sum(1 for _ in repeated))
        b.append(new_b)
        c.append(new_c)
    print(b)
    print(c)

Where the import empDataLT is the 500 record list implied as module. However, this code produces the following result:

[['A', 'c', 'o', 'u', 'n', 't', 'i', 'n', 'g'], [['A', 'c', 'o', 'u', 'n', 't', 'i', 'n', 'g'],
[[1, 2, 1, 1, 1, 1, 1, 1, 1], [1, 2, 1, 1, 1, 1, 1, 1, 1],

Yes, apparently it counts the letters of the departments instead. I'm still learning Python so I am not quite sure how to fix it or any workarounds for this. Thank you in advance! Cheers.

PS: the empData is a string, but should be considered as a list.

One more thing if it's not too much to ask, this also requires it to check which department has the highest number of employees. But this is not that important. I can look for this. :D

Patrick Artner · Accepted Answer

Using groupby is fine, but needs sorting.

Using a collections.defaultdict avoids sorting altogether:

s = """201601005,Raylene,Kampa,Purchasing,365,15,12/19/2001,; 
200909005,Flo,Bookamer,Human Resources,800,28,12/19/1957,; 
200512016,Jani,Biddy,Human Resources,565,20,8/7/1966,; 
199806004,Chauncey,Motley,Admin,450,24,3/1/2000"""


data = [ i.strip().split(",") for i in s.split(";")]

from collections import defaultdict
grpd_data = defaultdict(list)

for d in data:
    grpd_data[d[3]].append(d)


print(grpd_data)
print()

# sort by lenght of list descending and enumerate it:
for idx,(key,value) in enumerate(sorted(grpd_data.items(), key=lambda i:-len(i[1])), 1):
    print(idx,key,value,len(value))

Output (manually formatted):

 defaultdict(<class 'list'>, {
    'Purchasing': [['201601005', 'Raylene', 'Kampa', 'Purchasing', '365', '15', '12/19/2001', '']], 
    'Human Resources': [[' 200909005', 'Flo', 'Bookamer', 'Human Resources', '800', '28', '12/19/1957', ''], 
                        [' 200512016', 'Jani', 'Biddy', 'Human Resources', '565', '20', '8/7/1966', '']], 
    'Admin': [[' 199806004', 'Chauncey', 'Motley', 'Admin', '450', '24', '3/1/2000']]})

# with counts and sorted
1 Human Resources [[' 200909005', 'Flo', 'Bookamer', 'Human Resources', '800', '28', '12/19/1957', ''], 
                   [' 200512016', 'Jani', 'Biddy', 'Human Resources', '565', '20', '8/7/1966', '']] 2
2 Purchasing      [['201601005', 'Raylene', 'Kampa', 'Purchasing', '365', '15', '12/19/2001', '']] 1
3 Admin           [[' 199806004', 'Chauncey', 'Motley', 'Admin', '450', '24', '3/1/2000']] 1

Edit - bigger data:

big = s
for _ in range(200):
    big += ";"+s

s = big 

data = [ i.strip().split(",") for i in s.split(";")]

from collections import defaultdict
gr = defaultdict(list)

for d in data:
    gr[d[3]].append(d)


for idx,(key,value) in enumerate(sorted(gr.items(), key=lambda i:-len(i[1])),1):
    print(idx, len(value))

Output:

1 402
2 201
3 201

Count elements, then remove duplicates

Tags:

python

python-3.x

Gaaaaaab

1 Answers

Patrick Artner

Recent Activity

Donate For Us

Count elements, then remove duplicates

Tags:

python

python-3.x

Gaaaaaab

1 Answers

Patrick Artner

Related questions

Recent Activity

Donate For Us