Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Avoid inner loop while iterating through nested data (performance improvement)?

I am working with a large dataset (1 million + rows) and am tasked with counting up each truthy value for each ID and generating a new dict.

The solution I came up with works, but does very poor in regards to performance.

I have a dictionary as follow:

  "data": {
    "employees": [
      {
        "id": 274,
        "report": true
      },
      {
        "id": 274,
        "report": false
      },
      {
        "id": 276,
        "report": true
      },
      {
        "id": 276,
        "report": true
      },
      {
        "id": 278,
        "report": true
      },
      {
        "id": 278,
        "report": false
      }
    ]
  }

I am looking to create a new dictionary with each individual employee ID with a count of each true value.

Something like this:

{274: {'id': 274, 'count': 1}, 276: {'id': 276, 'count': 2}, 278: {'id': 278, 'count': 1}}

My current code:

        final_dict = {}

        for employee in result["data"]["employees"]:
            if employee["id"] not in final_dict.keys():
                final_dict[employee["id"]] = {"id": employee["id"]}
                grouped_results = [res for res in result["data"]["employees"] if
                                   employee["id"] == res['id']]
                final_dict[employee["id"]]["count"] = len(
                    [res for res in grouped_results if res["report"]]
                )

                return final_dict

This does what it needs to do, but with the amount of data that is being processed it does very poorly.

I am looking for some advice on how to avoid the multiple loops, in order to improve performance. Any advice helps!

like image 496
master_j02 Avatar asked Oct 21 '25 03:10

master_j02


2 Answers

There is no need to make multiple passes, just accumulate as you go so it is linear time not quadratic

result = {}

for employee in input_dict["data"]["employees"]:
    _id = employee["id"]
    if _id not in result:
        # note id is being added redundantly maybe rethink this
        result[_id] = dict(id=_id, count=0)
    result[_id]["count"] += employee["report"]
like image 106
juanpa.arrivillaga Avatar answered Oct 22 '25 16:10

juanpa.arrivillaga


With dict.setdefault function:

report_counts = {}
for employee in result["data"]["employees"]:
    d = report_counts.setdefault(employee['id'], {'id': employee['id'], 'count': 0})
    d['count'] += employee['report']

print(report_counts)

{274: {'id': 274, 'count': 1}, 276: {'id': 276, 'count': 2}, 278: {'id': 278, 'count': 1}}
like image 41
RomanPerekhrest Avatar answered Oct 22 '25 17:10

RomanPerekhrest