Evaluating and removing duplicate dicts from a Python list

Question

Business problem: I have a list of dicts that represent the given student’s academic history…the classes they have taken, when they took them, what their grade was (blank indicates the class is in-progress), etc. I need to find any duplicate attempts at a given class and keep only the attempt with the highest grade.

What I’ve attempted so far:

acad_hist = [{‘crse_id’: u'GRG 302P0', ‘grade’: u’’}, {‘crse_id’: u’URB 3010', ‘grade’: u’B+‘},
{‘crse_id’: u'GRG 302P0', ‘grade’: u’D‘}]

grade_list = ['CR', 'D-', 'D', 'D+', 'C-', 'C', 'C+', 'B-', 'B', 'B+', 'A-', 'A', 'A+']

At first I tried to loop through the acad_hist list and add any classes not-yet-seen to the “seen” list. It was then the plan that when I come across a class that had already been added to the “seen” list, I should go back to the acad_hist list, grab the details (e.g. "grade") of that class, evaluate the grades, and remove the class with the lower grade from the acad_hist list. Problem is, I’m having a tough time easily going back and “grabbing” the earlier seen class from the “seen” list and even more difficulty correctly pointing to it once I know I need to delete it from the acad_hist list. The code is a mess but here is what I have so far:
```
key = ‘crse_id’
for index, course in enumerate(acad_hist[:]):
    if course[key] not in seen:
        seen.append(course[key])
    else:
        logger.info('found duplicate {0} at index {1}'.format(course[key], index))
        < not sure what to do here… >
```
OUTPUT:
```
found duplicate GRG 302P0 at index 11
```
So then I thought I might be able to use the set() function to cull the list for me, but the problem here is that I need to choose which class instance to keep and set() doesn’t seem to allow me a way to do that.
```
names = set(d['compressed_hist_crse_id'] for d in acad_hist_condensed)
logger.info('TEST names: {0}'.format(names))
```
OUTPUT:
```
TEST names: set([u'GRG 302P0', u'URB 3010’}]
```

Wanting to see if I could add to #2 above, I thought I’d do some “belt-n-suspenders” looping through the output of the set() “names” and collect a grade. It’s working, but I don’t pretend to fully understand what it’s doing, nor does it really allow me to do the processing I need to do.

new_dicts = []
for name in names:
    d = dict(name=name)
    d['grade'] = max(d['grade'] for d in acad_hist if d['crse_id'] == name)
    new_dicts.append(d)
logger.info('TEST new_dicts: {0}'.format(new_dicts))

OUTPUT:

TEST new_dicts: [{'grade': u'', 'name': u'GRG 302P0'}, {'grade': u’B’+, 'name': u'URB 3010'}]

Can anyone provide me with the missing pieces, or even a better way to do this?

UPDATE -- the solution I ended up with (adaptation of ideas I got from the accepted answer)

def scrub_for_duplicate_courses(acad_hist_condensed, acad_hist_list):
"""
Looks for duplicate courses that may have been taken, and if any are found, will look for the one with the highest
grade and keep that one, deleting the other course from the lists before returning them.
"""

# -------------------------------------------
# set logging params
# -------------------------------------------
logger = logging.getLogger(__name__)

# -----------------------------------------------------------------------------------------------------
# the grade_list is in order of ascending priority/value...a blank grade indicates "in-progress", and
# will therefore replace any class instance that has a grade.
# -----------------------------------------------------------------------------------------------------
grade_list = ['CR', 'D-', 'D', 'D+', 'C-', 'C', 'C+', 'B-', 'B', 'B+', 'A-', 'A', 'A+', '']
# converting the grade_list in to a more efficient, weighted dict
grade_list = dict(zip(grade_list, range(len(grade_list))))

seen_courses = {}

for course in acad_hist_condensed[:]:
    # -----------------------------------------------------------------------------------------------------
    # one of the two keys checked for below should exist in the list, but not both
    # -----------------------------------------------------------------------------------------------------
    key = ''
    if 'compressed_hist_crse_id' in course:
        key = 'compressed_hist_crse_id'
    elif 'compressed_ovrd_crse_id' in course:
        key = 'compressed_ovrd_crse_id'

    cid = course[key]
    grade = course['grade']

    if cid not in seen_courses:
        seen_courses[cid] = grade
    else:
        # ---------------------------------------------------------------------------------------------------------
        # if we get here, a duplicate course_id has been found in the acad_hist_condensed list, so now we'll want
        # to determine which one has the lowest grade, and remove that course instance from both lists.
        # ---------------------------------------------------------------------------------------------------------
        if grade_list.get(seen_courses[cid], 0) < grade_list.get(grade, 0):
            seen_courses[cid] = grade  # this will overlay the grade for the record already in seen_courses
            grade_for_rec_to_remove = seen_courses[cid]
            crse_id_for_rec_to_remove = cid
        else:
            grade_for_rec_to_remove = grade
            crse_id_for_rec_to_remove = cid

        # -----------------------------------------------------------------------------------------------------
        # find the rec in acad_hist_condensed that needs removal
        # -----------------------------------------------------------------------------------------------------
        for rec in acad_hist_condensed:
            if rec[key] == crse_id_for_rec_to_remove and rec['grade'] == grade_for_rec_to_remove:
                acad_hist_condensed.remove(rec)
        for rec in acad_hist_list:
            if rec == crse_id_for_rec_to_remove:
                acad_hist_list.remove(rec)
                break  # just want to remove one occurrence

return acad_hist_condensed, acad_hist_list

Charles · Accepted Answer

A simple solution would be to iterated over each student's course history and calculate the max grade in each course…

acad_hist = [{'crse_id': u'GRG 302P0', 'grade': u''}, {'crse_id': u'URB 3010', 'grade': u'B+'}, {'crse_id': u'GRG 302P0', 'grade': u'D'}]

grade_list = ['CR', 'D-', 'D', 'D+', 'C-', 'C', 'C+', 'B-', 'B', 'B+', 'A-', 'A', 'A+']
#let's turn grade_list into something more efficient:
grade_list = dict(zip(grade_list, range(len(grade_list)))) # 'CR' == 0, 'D-' == 1

courses = {} # keys will be crse_id, values will be grade.
for course in acad_hist:
    cid = course['crse_id']
    g = course['grade']
    if cid not in courses:
        courses[cid] = g 
    else:
        if grade_list.get(courses[cid], 0) < grade_list.get(g,0):
            courses[cid] = g

The output would be:

{u'GRG 302P0': u'D', u'URB 3010': u'B+'}

which could be rewritten back to it's original form if needed

Peter Sutton · Answer

This can be done using iterator Lego (namely ifilter, sorted, groupby, and max)

def find_best_grades(history):
    def course(course_grade):
        return course_grade['crse_id']
    def grade(course_grade):
        return GRADES[course_grade['grade']]
    def has_grade(course_grade):
        return bool(course_grade['grade'])

    # 1) Remove course grades without grades.
    # 2) Sort the history so that grades for the same course are
    #    consecutive (this allows groupby to work).
    # 3) Group grades for the same course together.
    # 4) Use max to select the high grade obtains for a course.

    return [max(course_grades, key=grade)
            for _, course_grades in
            groupby(sorted(ifilter(has_grade, history), key=course),
                    key=course)]

Dull complete code

from itertools import groupby, ifilter


COURSE_ID = 'crse_id'
GRADE = 'grade'

ACADEMIC_HISTORY = [
    {
        COURSE_ID: 'GRG 302P0',
        GRADE    : 'B',
    },
    {
        COURSE_ID: 'GRG 302P0',
        GRADE    : '',
    },
    {
        COURSE_ID: 'URB 3010',
        GRADE    : 'B+',
    },
    {
        COURSE_ID: 'GRG 302P0',
        GRADE    : 'D',
    },
]

GRADES = [
    'CR',
    'D-',
    'D' ,
    'D+',
    'C-',
    'C' ,
    'C+',
    'B-',
    'B' ,
    'B+',
    'A-',
    'A' ,
    'A+',
]

GRADES = dict(zip(GRADES, range(len(GRADES))))


def find_best_grades(history):
    def course(course_grade):
        return course_grade['crse_id']
    def grade(course_grade):
        return GRADES[course_grade['grade']]
    def has_grade(course_grade):
        return bool(course_grade['grade'])

    # 1) Remove course grades without grades.
    # 2) Sort the history so that grades for the same course are
    #    consecutive (this allows groupby to work).
    # 3) Group grades for the same course together.
    # 4) Use max to select the high grade obtains for a course.

    return [max(course_grades, key=grade)
            for _, course_grades in
            groupby(sorted(ifilter(has_grade, history), key=course),
                    key=course)]

best_grades = find_best_grades(ACADEMIC_HISTORY)
print best_grades

Evaluating and removing duplicate dicts from a Python list

Tags:

python

UPDATE -- the solution I ended up with (adaptation of ideas I got from the accepted answer)

KeithE

2 Answers

Charles

Peter Sutton

Recent Activity

Donate For Us

Evaluating and removing duplicate dicts from a Python list

Tags:

python

UPDATE -- the solution I ended up with (adaptation of ideas I got from the accepted answer)

KeithE

2 Answers

Charles

Peter Sutton

Related questions

Recent Activity

Donate For Us