Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Numpy: how to convert observations to probabilities?

Tags:

python

numpy

I have a feature matrix and a corresponding targets, which are ones or zeroes:

# raw observations
features = np.array([[1, 1, 0],
                     [1, 1, 0],
                     [0, 1, 0],
                     [0, 1, 0],
                     [0, 1, 0],
                     [0, 0, 1]])

targets = np.array([1, 0, 1, 1, 0, 0])

As you can see, each feature may correspond to both ones and zeros. I need to convert my raw observation matrix to probability matrix, where each feature will correspond to the probability of seeing one as a target:

[1 1 0] -> 0.5
[0 1 0] -> 0.67
[0 0 1] -> 0

I have constructed a quite straight-forward solution:

import numpy as np

# raw observations
features = np.array([[1, 1, 0],
                     [1, 1, 0],
                     [0, 1, 0],
                     [0, 1, 0],
                     [0, 1, 0],
                     [0, 0, 1]])

targets = np.array([1, 0, 1, 1, 0, 0])

from collections import Counter

def convert_obs_to_proba(features, targets):
    features_ = []
    targets_ = []

    # compute unique rows (idx will point to some representative)
    b = np.ascontiguousarray(features).view(np.dtype((np.void, features.dtype.itemsize * features.shape[1])))
    _, idx = np.unique(b, return_index=True)

    idx = idx[::-1]

    zeros = Counter()
    ones = Counter()

    # collect row-wise number of one and zero targets
    for i, row in enumerate(features[:]):        
        if targets[i] == 0:
            zeros[tuple(row)] += 1
        else:
            ones[tuple(row)] += 1

    # iterate over unique features and compute probabilities
    for k in idx:
        unique_row = features[k]

        zero_count = zeros[tuple(unique_row)]
        one_count = ones[tuple(unique_row)]

        proba = float(one_count) / float(zero_count + one_count)

        features_.append(unique_row)
        targets_.append(proba)

    return np.array(features_), np.array(targets_)

features_, targets_ = convert_obs_to_proba(features, targets)

print(features_)
print(targets_)

which:

  • extracts unique features;
  • counts number of zero and one observations targets for each unique feature;
  • computes probability and constructs the result.

Could it be solved in a prettier way using some advanced numpy magic?

Update. Previous code was pretty inefficient O(n^2). Converted it to more performance-friendly. Old code:

import numpy as np

# raw observations
features = np.array([[1, 1, 0],
                     [1, 1, 0],
                     [0, 1, 0],
                     [0, 1, 0],
                     [0, 1, 0],
                     [0, 0, 1]])

targets = np.array([1, 0, 1, 1, 0, 0])

def convert_obs_to_proba(features, targets):
    features_ = []
    targets_ = []

    # compute unique rows (idx will point to some representative)
    b = np.ascontiguousarray(features).view(np.dtype((np.void, features.dtype.itemsize * features.shape[1])))
    _, idx = np.unique(b, return_index=True)

    idx = idx[::-1]

    # calculate ZERO class occurences and ONE class occurences
    for k in idx:
        unique_row = features[k]

        zeros = 0
        ones = 0

        for i, row in enumerate(features[:]):        
            if np.array_equal(row, unique_row):            
                if targets[i] == 0:
                    zeros += 1
                else:
                    ones += 1

        proba = float(ones) / float(zeros + ones)

        features_.append(unique_row)
        targets_.append(proba)

    return np.array(features_), np.array(targets_)

features_, targets_ = convert_obs_to_proba(features, targets)

print(features_)
print(targets_)
like image 641
Denis Kulagin Avatar asked Mar 30 '17 12:03

Denis Kulagin


1 Answers

It's easy using Pandas:

df = pd.DataFrame(features)
df['targets'] = targets

Now you have:

   0  1  2  targets
0  1  1  0        1
1  1  1  0        0
2  0  1  0        1
3  0  1  0        1
4  0  1  0        0
5  0  0  1        0

Now, the fancy part:

df.groupby([0,1,2]).targets.mean()

Gives you:

0  1  2
0  0  1    0.000000
   1  0    0.666667
1  1  0    0.500000
Name: targets, dtype: float64

Pandas doesn't print the 0 at the leftmost part of the 0.666 row, but if you inspect the value there, it is indeed 0.

like image 174
John Zwinck Avatar answered Nov 01 '22 04:11

John Zwinck