I'm trying to group values in a few csv
files into bins that are in an XML
file (groups.xml
). I have the following code that works to a certain extent but doesn't give what I'm expecting:
import os, sys
import glob
import pandas as pd
import xml.etree.cElementTree as ET
def xml_parse():
try:
os.chdir("path/to/files")
filename = [file1 for file1 in glob.glob("*.csv")]
filename = [i.split('.', 1)[0] for i in filename]
#filename = '\n'.join(filename)
os.chdir('..')
output = []
doc = ET.parse("groups.xml").getroot()
for root_ele in doc.findall('Groups'):
tag_ele = root_ele.find('GroupID').text
for name in filename:
if name == tag_ele.lower():
for root_ele1 in root_ele.findall('groupname'):
displayname = root_ele1.find('Name').text
minval = root_ele1.find('min').text
mininc = root_ele1.find('minInc').text
maxvalue = root_ele1.find('max')
maxinclusive = root_ele1.find('maxInc')
lists = []
frame = pd.DataFrame()
fname = "path/to/files" + name + ".csv"
df = pd.read_csv(fname, index_col=None, header=None)
lists.append(df)
frame = pd.concat(lists)
if maxvalue is not None:
maxval = maxvalue.text
if maxinclusive is not None:
maxinc = maxinclusive.text
df['bin'] = pd.cut(frame[1], [float(minval),float(maxval)], right= maxinc, include_lowest= mininc)
out = str(pd.concat([df['bin'], frame[1]], axis=1))
out = out.split("\n")[2:]
for a in out:
print a
else:
df['bin'] = pd.cut(frame[1], [float(minval)], include_lowest= mininc)
out = str(pd.concat([df['bin'], frame[1]], axis=1))
out = out.split("\n")[2:]
for a in out:
print a
break
except AttributeError:
pass
Current output:
1 NaN 10.18
2 NaN 25.16
3 NaN 44.48
4 NaN 85.24
5 NaN 36.71
6 NaN 77.09
7 NaN 81.88
8 NaN 22.92
9 NaN 44.31
10 NaN 15.79
1 [10, 18] 10.18
2 NaN 25.16
3 NaN 44.48
4 NaN 85.24
5 NaN 36.71
6 NaN 77.09
7 NaN 81.88
8 NaN 22.92
9 NaN 44.31
10 [10, 18] 15.79
1 NaN 10.18
2 [18, 35] 25.16
3 NaN 44.48
4 NaN 85.24
5 NaN 36.71
6 NaN 77.09
7 NaN 81.88
8 [18, 35] 22.92
9 NaN 44.31
10 NaN 15.79
1 NaN 10.18
2 NaN 25.16
3 [35, 50] 44.48
4 NaN 85.24
5 [35, 50] 36.71
6 NaN 77.09
7 NaN 81.88
8 NaN 22.92
9 [35, 50] 44.31
10 NaN 15.79
1 NaN 10.18
2 NaN 25.16
3 NaN 44.48
4 NaN 85.24
5 NaN 36.71
6 NaN 77.09
7 NaN 81.88
8 NaN 22.92
9 NaN 44.31
10 NaN 15.79
1 NaN 10.18
2 NaN 25.16
3 NaN 44.48
4 NaN 85.24
5 NaN 36.71
6 NaN 77.09
7 NaN 81.88
8 NaN 22.92
9 NaN 44.31
10 NaN 15.79
with an error:
Traceback (most recent call last):
File "groups.py", line 69, in <module>
xml_parse()
File "groups.py", line 44, in xml_parse
df['bin'] = pd.cut(frame[1], [float(minval)], include_lowest= mininc)
File "C:\Python27\lib\site-packages\pandas\tools\tile.py", line 113, in cut
include_lowest=include_lowest)
File "C:\Python27\lib\site-packages\pandas\tools\tile.py", line 203, in _bins_to_cuts
include_lowest=include_lowest)
File "C:\Python27\lib\site-packages\pandas\tools\tile.py", line 252, in _format_levels
levels[0] = '[' + levels[0][1:]
IndexError: list index out of range
Expected output:
1 [10, 18] 10.18
2 [18, 35] 25.16
3 [35, 50] 44.48
4 [>= 75] 85.24 #however >=75 can be represented
5 [35, 50] 36.71
6 [>= 75] 77.09
7 [>= 75] 81.88
8 [18, 35] 22.92
9 [35, 50] 44.31
10 [10, 18] 15.79
Groupby is a very powerful pandas method. You can group by one column and count the values of another column per this column value using value_counts. Using groupby and value_counts we can count the number of activities each person did.
cut() method in Python. Pandas cut() function is used to separate the array elements into different bins . The cut function is mainly used to perform statistical analysis on scalar data.
Use pd. cut() for binning data based on the range of possible values. Use pd. qcut() for binning data based on the actual distribution of values.
The Hello, World! of pandas GroupBy You call . groupby() and pass the name of the column that you want to group on, which is "state" . Then, you use ["last_name"] to specify the columns on which you want to perform the actual aggregation. You can pass a lot more than just a single column name to .
Starting with:
df:
val1 val2
0 NaN 10
1 10.18 1
2 25.16 1
3 44.48 1
4 85.24 1
5 36.71 1
6 77.09 1
7 81.88 1
8 22.92 1
9 44.31 1
10 15.79 1
and
xml = """
<metaGroups>
<Groups>
<GroupID>age</GroupID>
<description>age</description>
<groupname>
<Name>0 - <10</Name>
<min>0</min>
<minInc>TRUE</minInc>
<max>10</max>
<maxInc>FALSE</maxInc>
</groupname>
<groupname>
<Name>10 - <18</Name>
<min>10</min>
<minInc>TRUE</minInc>
<max>18</max>
<maxInc>FALSE</maxInc>
</groupname>
<groupname>
<Name>18 - <35</Name>
<min>18</min>
<minInc>TRUE</minInc>
<max>35</max>
<maxInc>FALSE</maxInc>
</groupname>
<groupname>
<Name>35 - <50</Name>
<min>35</min>
<minInc>TRUE</minInc>
<max>50</max>
<maxInc>FALSE</maxInc>
</groupname>
<groupname>
<Name>50 - <65</Name>
<min>50</min>
<minInc>TRUE</minInc>
<max>65</max>
<maxInc>FALSE</maxInc>
</groupname>
<groupname>
<Name>65 - <75</Name>
<min>65</min>
<minInc>TRUE</minInc>
<max>75</max>
<maxInc>FALSE</maxInc>
</groupname>
<groupname>
<Name>&ge;75</Name>
<min>75</min>
<minInc>TRUE</minInc>
</groupname>
</Groups>
</metaGroups>
"""
You can use BeautifulSoup
to extract the bin
parameters, construct the labels and apply pd.cut()
:
from bs4 import BeautifulSoup as Soup
from itertools import chain
soup = Soup(xml, 'html.parser')
bins = []
for message in soup.findAll('groupname'):
min = message.find('min').text
try:
max = message.find('max').text
bins.append([min, max])
except:
bins.append([min]) # For max bin
at which point we have
bins
[['0', '10'], ['10', '18'], ['18', '35'], ['35', '50'], ['50', '65'], ['65', '75'], ['75']]
Next, we'll flatten the list
of list
, get rid of duplicates and add an upper bound:
labels = bins
bins = list(np.unique(np.fromiter(chain.from_iterable(bins), dtype='int')))
last = bins[-1]
bins.append(int(df.val1.max() + 1))
which yields:
[0, 10, 18, 35, 50, 65, 75, 86]
Constructing the labels:
labels = ['[{0} - {1}]'.format(label[0], label[1]) if len(label) > 1 else '[ > {} ]'.format(label[0]) for label in labels]
and using pd.cut()
:
df['binned'] = pd.cut(df.val1, bins=bins, labels=labels)
produces:
val1 val2 binned
1 10.18 1 [10 - 18]
2 25.16 1 [18 - 35]
3 44.48 1 [35 - 50]
4 85.24 1 [>= 75]
5 36.71 1 [35 - 50]
6 77.09 1 [>= 75]
7 81.88 1 [>= 75]
8 22.92 1 [18 - 35]
9 44.31 1 [35 - 50]
10 15.79 1 [10 - 18]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With