I am slowly developing a data processing application in Python 2.6 (*). My test data are very small, like 5000 cases, but it is expected that there will be a million cases in the near future and I wondering if my current approach is workable under those conditions.
Structure of the problem: I have two csv files, one contains call (5000 rows, 20 columns) and another one details for a call (500 rows, 10 columns). I have to build a third csv file which will contain all cases from the "call" file where additional details where found. Behind the scenes there is some heavy lifting going on (merging and restructuring data in the details list, data comparison between lists). But I am very nervous about building the output list: at the moment the code looks like this:
def reduceOutputListToPossibleMatches(outputList, detailsList):
reducedList = list()
for outputItem in outputList:
isFound = False
for detailsItem in detailsList:
if detailsItem[14] == outputItem[4]:
if isfound:
detailsItem[30] = "1" #ambigous case
# - more than one match was found
# 1 is an indicator for true - I am not using python here because spss has no support for booleans.
isFound = True
if isFound:
reducedList.append(detailsItem )
return reducedList
I think that this algorithm will take a very long time, simply because I have to loop about two large lists. So my questions boils down to: How fast are lists in Python and are there better alternatives? Additionally: The double-List are somewhat inconvenient to handle, because I have to remember the index position of each column - is there a better alternative?
*=I am calling SPSS Version 19 later on, which refuses to work with newer versions of python.
From Elazar's answer, using a dict to avoid the inner loop:
def reduceOutputListToPossibleMatches(outputList, detailsList):
details = {}
for detailsItem in detailsList:
key = detailsItem[14]
if key in details:
details[key][30] = "1"
else:
details[key] = detailsItem
for outputItem in outputList:
key = outputItem[4]
if key in details:
yield details[key]
res = reduceOutputListToPossibleMatches(outputList, detailsList)
with open('somefile', 'w') as f:
f.writelines(res)
If you need all of the ambiguous lines:
def reduceOutputListToPossibleMatches(outputList, detailsList):
details = {}
for detailsItem in detailsList:
key = detailsItem[14]
if key in details:
details[key].append(detailsItem)
else:
details[key] = [detailsItem]
for outputItem in outputList:
key = outputItem[4]
if key in details:
for item in details[key]:
if len(details[key]) > 1:
item[30] = "1"
yield item
res = reduceOutputListToPossibleMatches(outputList, detailsList)
with open('somefile', 'w') as f:
f.writelines(res)
I think you don't need to return a list
. You can do something like this:
def reduceOutputListToPossibleMatches(outputList, detailsList):
for outputItem in outputList:
isFound = False
for detailsItem in detailsList:
if detailsItem[14] == outputItem[4]: #there was a syntax error here
if isfound:
detailsItem[30] = "1"
break
isFound = True
else:
yield detailsItem
res = reduceOutputListToPossibleMatches(outputList, detailsList)
with open('somefile', 'w') as f:
f.writelines(res)
But it's still O(n**2)
which is pretty slow. Maybe an SQL database (through Django?) will be more suitable for this task.
Small changes to @Duncan's suggestion:
from collections import defaultdict
def reduceOutputListToPossibleMatches(outputList, detailsList):
details = defaultdict(list)
for detailsItem in detailsList:
key = detailsItem[14]
details[key].append(detailsItem)
for outputItem in outputList:
val = details[outputItem[4]]
if len(val) > 1:
for item in val:
item[30] = "1"
yield from val
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With