Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can Lists cope with a million items?

Tags:

python

csv

I am slowly developing a data processing application in Python 2.6 (*). My test data are very small, like 5000 cases, but it is expected that there will be a million cases in the near future and I wondering if my current approach is workable under those conditions.

Structure of the problem: I have two csv files, one contains call (5000 rows, 20 columns) and another one details for a call (500 rows, 10 columns). I have to build a third csv file which will contain all cases from the "call" file where additional details where found. Behind the scenes there is some heavy lifting going on (merging and restructuring data in the details list, data comparison between lists). But I am very nervous about building the output list: at the moment the code looks like this:

def reduceOutputListToPossibleMatches(outputList, detailsList):
    reducedList = list()

    for outputItem in outputList:
        isFound = False
        for detailsItem in detailsList:
            if detailsItem[14] == outputItem[4]:
                if isfound:
                    detailsItem[30] = "1" #ambigous case
                                          # - more than one match was found
# 1 is an indicator for true - I am not using python here because spss has no support for booleans.
                isFound = True
        if isFound:
            reducedList.append(detailsItem )

    return reducedList

I think that this algorithm will take a very long time, simply because I have to loop about two large lists. So my questions boils down to: How fast are lists in Python and are there better alternatives? Additionally: The double-List are somewhat inconvenient to handle, because I have to remember the index position of each column - is there a better alternative?

*=I am calling SPSS Version 19 later on, which refuses to work with newer versions of python.

like image 718
Christian Sauer Avatar asked Sep 25 '13 12:09

Christian Sauer


2 Answers

From Elazar's answer, using a dict to avoid the inner loop:

def reduceOutputListToPossibleMatches(outputList, detailsList):
    details = {}
    for detailsItem in detailsList:
        key = detailsItem[14]
        if key in details:
            details[key][30] = "1"
        else:
            details[key] = detailsItem

    for outputItem in outputList:
        key = outputItem[4]
        if key in details:
            yield details[key]

res = reduceOutputListToPossibleMatches(outputList, detailsList)
with open('somefile', 'w') as f:
    f.writelines(res)

If you need all of the ambiguous lines:

def reduceOutputListToPossibleMatches(outputList, detailsList):
    details = {}
    for detailsItem in detailsList:
        key = detailsItem[14]
        if key in details:
            details[key].append(detailsItem)
        else:
            details[key] = [detailsItem]

    for outputItem in outputList:
        key = outputItem[4]
        if key in details:
            for item in details[key]:
                if len(details[key]) > 1:
                    item[30] = "1"
                yield item

res = reduceOutputListToPossibleMatches(outputList, detailsList)
with open('somefile', 'w') as f:
    f.writelines(res)
like image 91
Duncan Avatar answered Sep 22 '22 22:09

Duncan


I think you don't need to return a list. You can do something like this:

def reduceOutputListToPossibleMatches(outputList, detailsList):
    for outputItem in outputList:
        isFound = False
        for detailsItem in detailsList:
            if detailsItem[14] == outputItem[4]: #there was a syntax error here
                if isfound:
                    detailsItem[30] = "1"
                    break
                isFound = True
        else:
            yield detailsItem

res = reduceOutputListToPossibleMatches(outputList, detailsList)
with open('somefile', 'w') as f:
    f.writelines(res)

But it's still O(n**2) which is pretty slow. Maybe an SQL database (through Django?) will be more suitable for this task.

Small changes to @Duncan's suggestion:

from collections import defaultdict
def reduceOutputListToPossibleMatches(outputList, detailsList):
    details = defaultdict(list)
    for detailsItem in detailsList:
        key = detailsItem[14]
        details[key].append(detailsItem)

    for outputItem in outputList:
        val = details[outputItem[4]]
        if len(val) > 1:
            for item in val:
                item[30] = "1"
        yield from val
like image 43
Elazar Avatar answered Sep 24 '22 22:09

Elazar