I have a very large CSV of data, and I need to append previous data to each row for each name in the column 2 for dates previous to the current one stipultated in column2. I think the easiest way to represent this problem is to provide a detailed example similar to my real data, but scaled down significantly:
Datatitle,Date,Name,Score,Parameter
data,01/09/13,george,219,dataa,text
data,01/09/13,fred,219,datab,text
data,01/09/13,tom,219,datac,text
data,02/09/13,george,229,datad,text
data,02/09/13,fred,239,datae,text
data,02/09/13,tom,219,dataf,text
data,03/09/13,george,209,datag,text
data,03/09/13,fred,217,datah,text
data,03/09/13,tom,213,datai,text
data,04/09/13,george,219,dataj,text
data,04/09/13,fred,212,datak,text
data,04/09/13,tom,222,datal,text
data,05/09/13,george,319,datam,text
data,05/09/13,fred,225,datan,text
data,05/09/13,tom,220,datao,text
data,06/09/13,george,202,datap,text
data,06/09/13,fred,226,dataq,text
data,06/09/13,tom,223,datar,text
data,06/09/13,george,219,dataae,text
So for threee first rows of this csv there is no previous data. So if we said we wanted to pull column 3 & 4 for the last 3 occurances of george (row1) on a date previous to the current one it would yeild:
data,01/09/13,george,219,dataa,text,x,y,x,y,x,y
However when previous data starts to become avaialble we would hope to produce a csv such as this:
Datatitle,Date,Name,Score,Parameter,LTscore,LTParameter,LTscore+1,LTParameter+1,LTscore+2,LTParameter+3,
data,01/09/13,george,219,dataa,text,x,y,x,y,x,y
data,01/09/13,fred,219,datab,text,x,y,x,y,x,y
data,01/09/13,tom,219,datac,text,x,y,x,y,x,y
data,02/09/13,george,229,datad,text,219,dataa,x,y,x,y
data,02/09/13,fred,239,datae,text,219,datab,x,y,x,y
data,02/09/13,tom,219,dataf,text,219,datac,x,y,x,y
data,03/09/13,george,209,datag,text,229,datad,219,dataa,x,y
data,03/09/13,fred,217,datah,text,239,datae,219,datab,x,y
data,03/09/13,tom,213,datai,text,219,dataf,219,datac,x,y
data,04/09/13,george,219,dataj,text,209,datag,229,datad,219,dataa
data,04/09/13,fred,212,datak,text,217,datah,239,datae,219,datab
data,04/09/13,tom,222,datal,text,213,datai,219,dataf,219,datac
data,05/09/13,george,319,datam,text,219,dataj,209,datag,229,datad
data,05/09/13,fred,225,datan,text,212,datak,217,datah,239,datae
data,05/09/13,tom,220,datao,text,222,datal,213,datai,219,dataf
data,06/09/13,george,202,datap,text,319,datam,219,dataj,209,datag
data,06/09/13,fred,226,dataq,text,225,datan,212,datak,217,datah
data,06/09/13,tom,223,datar,text,220,datao,222,datal,213,datai
data,06/09/13,george,219,datas,text,319,datam,219,dataj,209,datag
You will notice for the 06/09/13 george occurs twice and both times he has the same string 319,datam,219,dataj,209,datag
appended to his row. For the second time george appears he gets this same string appended because the george 3 rows above is on the same date. (This is just emphasising the "on a date previous to the current one."
As you can see from the column titles we are collecting the last 3 scores and the associated 3 parameters and appending them to each row. Please note, this is a very simplified example. In reality each date will contain a couple of thousand rows, in the real data there is also no pattern to the names, so we wouldnt expect to see fred,tom,george next to each other on a repeating pattern. If anyone can help me work out how best to achieve this (most efficient) I would be very greatful. If anything is unclear please let me know, I will add more detail. Any constructive comments appreciated. Thanks SMNALLY
It appears your file is in date order. If we take the last entry per name per date, and add that to a sized deque for each name while writing out each row, that should do the trick:
import csv
from collections import deque, defaultdict
from itertools import chain, islice, groupby
from operator import itemgetter
# defaultdict whose first access of a key will create a deque of size 3
# defaulting to [['x', 'y'], ['x', 'y'], ['x' ,'y']]
# Since deques are efficient at head/tail manipulation, then an insert to
# the start is efficient, and when the size is fixed it will cause extra
# elements to "fall off" the end...
names_previous = defaultdict(lambda: deque([['x', 'y']] * 3, 3))
with open('sample.csv', 'rb') as fin, open('sample_new.csv', 'wb') as fout:
csvin = csv.reader(fin)
csvout = csv.writer(fout)
# Use groupby to detect changes in the date column. Since the data is always
# asending, the items within the same data are contigious in the data. We use
# this to identify the rows within the *same* date.
# date=date we're looking at, rows=an iterable of rows that are in that date...
for date, rows in groupby(islice(csvin, 1, None), itemgetter(1)):
# After we've processed entries in this date, we need to know what items of data should
# be considered for the names we've seen inside this date. Currently the data
# is taken from the last occurring row for the name.
to_add = {}
for row in rows:
# Output the row present in the file with a *flattened* version of the extra data
# (previous items) that we wish to apply. eg:
# [['x, 'y'], ['x', 'y'], ['x', 'y']] becomes ['x', 'y', 'x', 'y', 'x', y']
# So we're easily able to store 3 pairs of data, but flatten it into one long
# list of 6 items...
# If the name (row[2]) doesn't exist yet, then by trying to do this, defaultdict
# will automatically create the default key as above.
csvout.writerow(row + list(chain.from_iterable(names_previous[row[2]])))
# Here, we store for the name any additional data that should be included for the name
# on the next date group. In this instance we store the information seen for the last
# occurrence of that name in this date. eg: If we've seen it more than once, then
# we only include data from the last occurrence.
# NB: If you wanted to include more than one item of data for the name, then you could
# utilise a deque again by building it within this date group
to_add[row[2]] = row[3:5]
for key, val in to_add.iteritems():
# We've finished the date, so before processing the next one, update the previous data
# for the names. In this case, we push a single item of data to the front of the deck.
# If, we were storing multiple items in the data loop, then we could .extendleft() instead
# to insert > 1 set of data from above.
names_previous[key].appendleft(val)
This keeps only the names and the last 3 values in memory during the run.
May wish to adjust to include correct/write new headers instead of just skipping those on input.
My two cents:
- Python 2.7.5
- I used a defaultdict to hold the previous rows for each Name.
- I used bounded length deques to hold previous rows because I liked the fifo behavior of a full deque. It made it easy for me to think about it - just keep shoving stuff into it.
- I used operator.itemgetter() for indexing and slicing because it just reads better.
from collections import deque, defaultdict
import csv
from functools import partial
from operator import itemgetter
# use a 3 item deque to hold the
# previous three rows for each name
deck3 = partial(deque, maxlen = 3)
data = defaultdict(deck3)
name = itemgetter(2)
date = itemgetter(1)
sixplus = itemgetter(slice(6,None))
fields = ['Datatitle', 'Date', 'Name', 'Score', 'Parameter',
'LTscore', 'LTParameter', 'LTscore+1', 'LTParameter+1',
'LTscore+2', 'LTParameter+3']
with open('data.txt') as infile, open('processed.txt', 'wb') as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
writer.writerow(fields)
# comment out the next line if the data file does not have a header row
reader.next()
for row in reader:
default = deque(['x', 'y', 'x', 'y', 'x', 'y'], maxlen = 6)
try:
previous_row = data[name(row)][-1]
previous_date = date(previous_row)
except IndexError:
previous_date = None
if previous_date == date(row):
# use the xtra stuff from last time
row.extend(sixplus(previous_row))
# discard the previous row because
# there is a new row with the same date
data[name(row)].pop()
else:
# add columns 3 and 4 from each previous row
for deck in data[name(row)]:
# adding new items to a full deque causes
# items to drop off the other end
default.appendleft(deck[4])
default.appendleft(deck[3])
row.extend(default)
writer.writerow(row)
data[name(row)].append(row)
After thinking about that solution a bit over a glass of port I realized it was just way too complicated - that tends to happen when I try to be fancy. Not really sure about the protocol so I'll leave it up - it does have a possible advantage of maintaining the previous 3 rows for each name.
Here is a solution using slices and a regular dictionary. It only keeps the previously processed row. Much simpler. I kept the itemgetters, again for readability.
import csv
from operator import itemgetter
fields = ['Datatitle', 'Date', 'Name', 'Score', 'Parameter',
'LTscore', 'LTParameter', 'LTscore+1', 'LTParameter+1',
'LTscore+2', 'LTParameter+3']
name = itemgetter(2)
date = itemgetter(1)
cols_sixplus = itemgetter(slice(6,None))
cols34 = itemgetter(slice(3, 5))
cols6_9 = itemgetter(slice(6, 10))
data_alt = {}
with open('data.txt') as infile, open('processed_alt.txt', 'wb') as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
writer.writerow(fields)
# comment out the next line if the data file does not have a header row
reader.next()
for row in reader:
try:
previous_row = data_alt[name(row)]
except KeyError:
# first time this name encountered
row.extend(['x', 'y', 'x', 'y', 'x', 'y'])
data_alt[name(row)] = row
writer.writerow(row)
continue
if date(previous_row) == date(row):
# use the xtra stuff from last time
row.extend(cols_sixplus(previous_row))
else:
row.extend(cols34(previous_row))
row.extend(cols6_9(previous_row))
data_alt[name(row)] = row
writer.writerow(row)
I have found, for similar types of processing, that accumulating the rows and writing them in chunks, instead of individually, can enhance performance quite a bit. Also, if possible, reading the entire data file at once helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With