Using Numpy to find average value across data sets, with some missing data

Tags:

I have several (10 or so) CSV-formatted data sets. Each column of a data set represents one aspect of a running system (available RAM, CPU usage, open TCP connections, and so forth). Each row contains the values for these columns at one moment in time.

The data sets were captured during individual runs of the same test. The number of rows is not guaranteed to be the same in each data set (i.e.: some tests ran longer than others).

I want to produce a new CSV file that represents the "average" value, across all data sets, for a given time offset and a given column. Ideally, values missing in one data set would be ignored. If necessary, though, missing values could be assumed to be the same as the last known value, or the average of known values for that row.

A simplified example:

+---------------+    +---------------+       +---------------+
|     Set 1     |    |     Set 2     |       |    Average    |
+---+-----+-----+    +---+-----+-----+       +---+-----+-----+
| t |  A  |  B  |    | t |  A  |  B  |       | t |  A  |  B  |
+---+-----+-----+    +---+-----+-----+       +---+-----+-----+
| 1 | 10  | 50  |    | 1 | 12  | 48  |       | 1 | 11  | 49  |   
| 2 | 13  | 58  |    | 2 |  7  | 60  |       | 2 | 10  | 59  |   
| 3 |  9  | 43  |    | 3 | 17  | 51  |  =>   | 3 | 13  | 47  |   
| 4 | 14  | 61  |    | 4 | 12  | 57  |       | 4 | 13  | 59  |   
| : |  :  |  :  |    | : |  :  |  :  |       | : |  :  |  :  |   
| 7 |  4  | 82  |    | 7 | 10  | 88  |       | 7 |  7  | 86  |   
+---+-----+-----+    | 8 | 15  | 92  |       | 8 | 15  | 92  |
                     | 9 |  6  | 63  |       | 9 |  6  | 63  |
                     +---+-----+-----+       +---+-----+-----+

I'm new to numpy, having picked it up specifically for this project. What's the best way to do this? For data sets with the same number of rows (which I've been forcing by chopping longer data sets short), I just do:

d_avg = sum(dsets) / float(len(dsets))

where "dsets" is a list of the ndarrays containing the data from each CSV file. This works well, but I don't want to discard the data from the longer runs.

I can also resize the shorter runs to the length of the longest, but all the new fields are filled with "NoneType". Later operations then error when adding (for example) a float and a NoneType.

Any suggestions?

388

asked Nov 20 '09 02:11

Lemur

2 Answers

Why not just us numpy's ma (masked array) module?

maxLen = reduce(lambda a,b : max(a, b.shape[0]),
                dSets, 0)
all = N.ma.zeros((maxLen,)+ dSets[0].shape[1:] + (len(dSets),),
                     dtype=float)      # set the dtype to whatever
all.mask = True
for i, set in enumerate(dSets):
    all.mask[:len(set),...,i] = False
    all[:len(set),...,i] = set

mean = all.mean(axis=-1)

Of course, this only works if you can guarantee that the time in each row is the same across all arrays, i.e. set[i,0] == set[j,0] for all i,j

answered Oct 11 '22 04:10

Rupert Nash

Edit: I've revised my method, abandoning scipy.nanmean in favor of masked arrays.

If it is unclear what the code is doing at any point, first try putting print statements in. If it is still unclear, feel free to ask; I'll try my best to explain. The trick part is getting the t-values merged. (That was done with numpy array's searchsorted method.)

Playing with numpy has led me to believe that its speed advantages may not exist until the datasets get quite big (maybe you'll need at least 10,000 rows per data set). Otherwise, a pure python solution may be both easier to write and faster.

Here are the toy datasets I used:

% cat set1
1, 10, 50
2, 13, 58
3,9,43
4,14,61
7, 4, 82

% cat set2
1, 12, 48
2, 7, 60
3,17,51
4,12,57
7,10,88
8,15,92
9,6,63

And here is the code:

#!/usr/bin/env python
import numpy as np

filenames=('set1','set2')   # change this to list all your csv files
column_names=('t','a','b')

# slurp the csv data files into a list of numpy arrays
data=[np.loadtxt(filename, delimiter=',') for filename in filenames]

# Find the complete list of t-values
# For each elt in data, elt[a,b] is the value in the a_th row and b_th column
t_values=np.array(list(reduce(set.union,(set(elt[:,0]) for elt in data))))
t_values.sort()
# print(t_values)
# [ 1.  2.  3.  4.  7.  8.  9.]

num_rows=len(t_values)
num_columns=len(column_names)
num_datasets=len(filenames)

# For each data set, we compute the indices of the t_values that are used.
idx=[(t_values.searchsorted(data[n][:,0])) for n in range(num_datasets)]

data2=np.ma.zeros((num_rows,num_columns,num_datasets))
for n in range(num_datasets):
    data2[idx[n],:,n]=data[n][:,:]
data2=np.ma.masked_equal(data2, 0)
averages=data2.mean(axis=-1)
print(averages)
# [[1.0 11.0 49.0]
#  [2.0 10.0 59.0]
#  [3.0 13.0 47.0]
#  [4.0 13.0 59.0]
#  [7.0 7.0 85.0]
#  [8.0 15.0 92.0]
#  [9.0 6.0 63.0]]

answered Oct 11 '22 03:10

unutbu

Related questions
                            
                                Dependency injection: injecting user control in aspx page
                            
                                Migrating from SUnit to Phexample
                            
                                Open source motion-capture software
                            
                                Filtering a Measure (or Removing Outliers)
                            
                                default timeout handler for expect script
                            
                                Patterns/Practices for designing Web Services
                            
                                Take screen shot in XNA
                            
                                Why does -[UIWebView sizeThatFits:] give a larger size than document.body.offsetHeight/Width?
                            
                                Python importing modules differently when run interactively/non-interactively
                            
                                Order of destruction for Array of Objects
                            
                                ImageIO initImageJPEG instances getting allocated and never released
                            
                                MS Word VBA - Determine extent of "style run"?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With