I have a list of lists containing [yyyy, value] items, with each sub list ordered by the increasing years. Here is a sample:
A = [
[[2008, 5], [2009, 5], [2010, 2], [2011, 5], [2013, 17]],
[[2008, 6], [2009, 3], [2011, 1], [2013, 6]], [[2013, 9]],
[[2008, 4], [2011, 1], [2013, 4]],
[[2010, 3], [2011, 3], [2013, 1]],
[[2008, 2], [2011, 4], [2013, 1]],
[[2009, 1], [2010, 1], [2011, 3], [2013, 3]],
[[2010, 1], [2011, 1], [2013, 5]],
[[2011, 1], [2013, 4]],
[[2009, 1], [2013, 4]],
[[2008, 1], [2013, 3]],
[[2009, 1], [2013, 2]],
[[2013, 2]],
[[2011, 1], [2013, 1]],
[[2013, 1]],
[[2013, 1]],
[[2011, 1]],
[[2011, 1]]
]
What I need is to insert all the missing years between min(year) and max(year) and to make sure that the order is preserved. So, for example, taking the first sub-list of A:
[2008, 5], [2009, 5], [2010, 2], [2011, 5], [2013, 17]
should look like:
[min_year, 0]...[2008, 5], [2009, 5], [2010, 2], [2011, 5], [2012, 0],[2013, 17],..[max_year, 0]
Moreover, if any sublist contains only a single item then the same process should be applied to it so that the original value preserves its supposed order and rest of the min to max (year,value) items are inserted properly.
Any ideas?
Thanks.
minyear = 2008
maxyear = 2013
new_a = []
for group in A:
group = group
years = [point[0] for point in group]
print years
for year in range(minyear,maxyear+1):
if year not in years:
group.append([year,0])
new_a.append(sorted(group))
print new_a
This produces:
[ [[2008, 5], [2009, 5], [2010, 2], [2011, 5], [2012, 0], [2013, 17]],
[[2008, 6], [2009, 3], [2010, 0], [2011, 1], [2012, 0], [2013, 6]],
[[2008, 0], [2009, 0], [2010, 0], [2011, 0], [2012, 0], [2013, 9]],
[[2008, 4], [2009, 0], [2010, 0], [2011, 1], [2012, 0], [2013, 4]],
[[2008, 0], [2009, 0], [2010, 3], [2011, 3], [2012, 0], [2013, 1]],
[[2008, 2], [2009, 0], [2010, 0], [2011, 4], [2012, 0], [2013, 1]],
[[2008, 0], [2009, 1], [2010, 1], [2011, 3], [2012, 0], [2013, 3]],
[[2008, 0], [2009, 0], [2010, 1], [2011, 1], [2012, 0], [2013, 5]],
[[2008, 0], [2009, 0], [2010, 0], [2011, 1], [2012, 0], [2013, 4]],
[[2008, 0], [2009, 1], [2010, 0], [2011, 0], [2012, 0], [2013, 4]],
[[2008, 1], [2009, 0], [2010, 0], [2011, 0], [2012, 0], [2013, 3]],
[[2008, 0], [2009, 1], [2010, 0], [2011, 0], [2012, 0], [2013, 2]],
[[2008, 0], [2009, 0], [2010, 0], [2011, 0], [2012, 0], [2013, 2]],
[[2008, 0], [2009, 0], [2010, 0], [2011, 1], [2012, 0], [2013, 1]],
[[2008, 0], [2009, 0], [2010, 0], [2011, 0], [2012, 0], [2013, 1]],
[[2008, 0], [2009, 0], [2010, 0], [2011, 0], [2012, 0], [2013, 1]],
[[2008, 0], [2009, 0], [2010, 0], [2011, 1], [2012, 0], [2013, 0]],
[[2008, 0], [2009, 0], [2010, 0], [2011, 1], [2012, 0], [2013, 0]]]
Here you go, hope you like it!
min_year = 2007 # for testing purposes I used these years
max_year = 2014
final_list = [] # you're going to be adding to this list the corrected values
for outer in A: # start by iterating through each outer list in A
active_years = {} # use this dictionary to keep track of which years are in each list and their values; sorry if you don't know about dictionaries
for inner in outer: # now iterate through each year in each of the outer lists and create a dictionary entry for each (print to see what it's doing)
active_years[inner[0]] = inner[1] # see who I'm creating a new key-value pair with the key as the year given by the 0th index of inner
new_outer = [] # this will be your new outer list
for year in range(min_year, max_year + 1): # now add to your active_years dictionary all the other years and give them value 0
if year not in active_years.keys(): # only add the years not in your dictionary already
active_years[year] = 0
for entry in active_years.keys(): # we now iterate through each key, in order
new_outer += [[entry, active_years[entry]]] # create your new outer list, watch carefully the brackets
final_list += [new_outer] # add to the final_list
print final_list # presto
How about:
import numpy as np
def np_fill(data,min_year,max_year):
#Setup empty array
year_range=np.arange(min_year,max_year+1)
unit=np.dstack((year_range,np.zeros(max_year-min_year+1)))
overall=np.tile(unit,(len(data),1,1)).astype(np.int)
#Change the list to a list of ndarrays
data=map(np.array,data)
for num,line in enumerate(data):
#Find correct indices and update overall array
index=np.searchsorted(year_range,line[:,0])
overall[num,index,1]=line[:,1]
return overall
Run the code:
print np_fill(A,2008,2013)[:2]
[[[2008 5]
[2009 5]
[2010 2]
[2011 5]
[2012 0]
[2013 17]]
[[2008 6]
[2009 3]
[2010 0]
[2011 1]
[2012 0]
[2013 6]]]
print np_fill(A,2008,2013).shape
(18, 6, 2)
You have a duplicate for year 2013 in the second line of A, not sure if this is purposeful or not.
A few timings because I was curious, the source code can be found here. Please let me know if you find an error.
For start year / end year- (2008,2013):
np_fill took 0.0454630851746 seconds.
tehsockz_fill took 0.00737619400024 seconds.
zeke_fill_fill took 0.0146050453186 seconds.
Kind of expecting this- it takes a lot of time to convert to numpy arrays. For break even it looks like the span of the years needs to be about 30:
For start year / end year- (1985,2013):
np_fill took 0.049400806427 seconds.
tehsockz_fill took 0.0425939559937 seconds.
zeke_fill_fill took 0.0748357772827 seconds.
Numpy of course does progressively better from there. If you need to return a numpy array for whatever reason, the numpy algorithm is always faster.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With