Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How should I structure and access a table of data so that I can compare subsets easily in Python 3.5?

  1. Is there a faster, more pythonic way of doing this?
  2. What is generating this warning UserWarning: Boolean Series key will be reindexed to match DataFrame index. "DataFrame index.", UserWarning and should I be concerned with it?

I have a csv file with 3 columns: org, month, person.

| org |    month   | person |
| --- | ---------- | ------ |
|   1 | 2014-01-01 |    100 |
|   1 | 2014-01-01 |    200 |
|   1 | 2014-01-02 |    200 |
|   2 | 2014-01-01 |    300 |

Which I've read into a pandas.core.frame.DataFrame with:

data = pd.read_csv('data_base.csv', names=['month', 'org', 'person'], skiprows=1)

The end goal is to compare the intersection of persons between 2 consecutive periods with the set of persons in the first period.

org: 1, month: 2014-01-01, count(intersection((100, 200), 200)) / len(set(100, 200)) == 0.5

Edit: I got it to work with:

import pandas as pd
import sys

data = pd.read_csv('data_base.csv', names=['month', 'org', 'person'], skiprows=1)
data.sort_values(by=['org', 'month', 'person'])

results = {}
for _org in set(data.org):
    results[_org] = {}
    months = sorted(list(set(data[data.org == _org].month)))
    for _m1, _m2 in zip(months, months[1:]):
        _s1 = set(data[data.org == _org][data.month == _m1].person)
        _s2 = set(data[data.org == _org][data.month == _m2].person)
        results[_org][_m1] = float(len(_s1 & _s2) / len(_s1))
        print(str(_org) + '\t' + str(_m1) + '\t' + str(_m2) + '\t' + str(round(results[_org][_m1], 2)))
        sys.stdout.flush()

Which produces output like this:

UserWarning: Boolean Series key will be reindexed to match DataFrame index. "DataFrame index.", UserWarning
5640    2014-01-01  2014-02-01  0.75
5640    2014-02-01  2014-03-01  0.36
5640    2014-03-01  2014-04-01  0.6
...

But it's really slow and kind of ugly...at the current rate my back-of-the-envelope-calculation estimates it at about 22 hours for a 2-year batch of data.

like image 877
ZSH Avatar asked Feb 22 '16 19:02

ZSH


1 Answers

Admittedly, I have never used Pandas, so this may not be idiomatic. This just uses basic Python structures.

import collections
org_month_dict = collections.defaultdict(set)

# put the data into a simple, indexed data structure
for index, row in data.iterrows():
    org_month_dict[row['org'], row['month']].add(row['person'])

orgs = set(data.org)
months = sorted(set(data.months))
for org in orgs:
    for mindex in range(len(months)-1):
        m1 = months[mindex]
        m2 = months[mindex+1]
        print org_month_dict[org, m2] & org_month_dict[org, m1] # persons in common between month 1 and 2

This creates a "cached" lookup table in org_month_dict which is indexed by organization and month, saving you from doing the expensive data[data.org == _org][data.month == _m1] lookup in your inner loop. It should run significantly faster than your original code.

like image 54
nneonneo Avatar answered Sep 24 '22 15:09

nneonneo