UserWarning: Boolean Series key will be
reindexed to match DataFrame index. "DataFrame index.", UserWarning
and should I be concerned with it?I have a csv file with 3 columns: org, month, person.
| org | month | person |
| --- | ---------- | ------ |
| 1 | 2014-01-01 | 100 |
| 1 | 2014-01-01 | 200 |
| 1 | 2014-01-02 | 200 |
| 2 | 2014-01-01 | 300 |
Which I've read into a pandas.core.frame.DataFrame
with:
data = pd.read_csv('data_base.csv', names=['month', 'org', 'person'], skiprows=1)
The end goal is to compare the intersection of persons between 2 consecutive periods with the set of persons in the first period.
org: 1, month: 2014-01-01, count(intersection((100, 200), 200)) / len(set(100, 200)) == 0.5
Edit: I got it to work with:
import pandas as pd
import sys
data = pd.read_csv('data_base.csv', names=['month', 'org', 'person'], skiprows=1)
data.sort_values(by=['org', 'month', 'person'])
results = {}
for _org in set(data.org):
results[_org] = {}
months = sorted(list(set(data[data.org == _org].month)))
for _m1, _m2 in zip(months, months[1:]):
_s1 = set(data[data.org == _org][data.month == _m1].person)
_s2 = set(data[data.org == _org][data.month == _m2].person)
results[_org][_m1] = float(len(_s1 & _s2) / len(_s1))
print(str(_org) + '\t' + str(_m1) + '\t' + str(_m2) + '\t' + str(round(results[_org][_m1], 2)))
sys.stdout.flush()
Which produces output like this:
UserWarning: Boolean Series key will be reindexed to match DataFrame index. "DataFrame index.", UserWarning
5640 2014-01-01 2014-02-01 0.75
5640 2014-02-01 2014-03-01 0.36
5640 2014-03-01 2014-04-01 0.6
...
But it's really slow and kind of ugly...at the current rate my back-of-the-envelope-calculation estimates it at about 22 hours for a 2-year batch of data.
Admittedly, I have never used Pandas, so this may not be idiomatic. This just uses basic Python structures.
import collections
org_month_dict = collections.defaultdict(set)
# put the data into a simple, indexed data structure
for index, row in data.iterrows():
org_month_dict[row['org'], row['month']].add(row['person'])
orgs = set(data.org)
months = sorted(set(data.months))
for org in orgs:
for mindex in range(len(months)-1):
m1 = months[mindex]
m2 = months[mindex+1]
print org_month_dict[org, m2] & org_month_dict[org, m1] # persons in common between month 1 and 2
This creates a "cached" lookup table in org_month_dict
which is indexed by organization and month, saving you from doing the expensive data[data.org == _org][data.month == _m1]
lookup in your inner loop. It should run significantly faster than your original code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With