Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Grouping values using pandas cut

I'm trying to group values in a few csv files into bins that are in an XML file (groups.xml). I have the following code that works to a certain extent but doesn't give what I'm expecting:

import os, sys
import glob
import pandas as pd
import xml.etree.cElementTree as ET

def xml_parse():
    try:
        os.chdir("path/to/files")
        filename = [file1 for file1 in glob.glob("*.csv")]
        filename = [i.split('.', 1)[0] for i in filename]
        #filename = '\n'.join(filename)
        os.chdir('..')
        output = []
        doc = ET.parse("groups.xml").getroot()
        for root_ele in doc.findall('Groups'):
            tag_ele = root_ele.find('GroupID').text
            for name in filename:
                if name == tag_ele.lower():
                    for root_ele1 in root_ele.findall('groupname'):
                        displayname = root_ele1.find('Name').text
                        minval = root_ele1.find('min').text
                        mininc = root_ele1.find('minInc').text
                        maxvalue = root_ele1.find('max')
                        maxinclusive = root_ele1.find('maxInc')
                        lists = []
                        frame = pd.DataFrame()
                        fname = "path/to/files" + name + ".csv"
                        df = pd.read_csv(fname, index_col=None, header=None)
                        lists.append(df)
                        frame = pd.concat(lists)
                        if maxvalue is not None:
                            maxval = maxvalue.text
                            if maxinclusive is not None:
                                maxinc = maxinclusive.text
                                df['bin'] = pd.cut(frame[1], [float(minval),float(maxval)], right= maxinc, include_lowest= mininc)
                                out = str(pd.concat([df['bin'], frame[1]], axis=1))
                                out = out.split("\n")[2:]
                                for a in out:
                                    print a
                        else:
                            df['bin'] = pd.cut(frame[1], [float(minval)], include_lowest= mininc)
                            out = str(pd.concat([df['bin'], frame[1]], axis=1))  
                            out = out.split("\n")[2:]
                            for a in out:
                                print a
            break
    except AttributeError:
        pass

Current output:

1   NaN  10.18
2   NaN  25.16
3   NaN  44.48
4   NaN  85.24
5   NaN  36.71
6   NaN  77.09
7   NaN  81.88
8   NaN  22.92
9   NaN  44.31
10  NaN  15.79
1   [10, 18]  10.18
2        NaN  25.16
3        NaN  44.48
4        NaN  85.24
5        NaN  36.71
6        NaN  77.09
7        NaN  81.88
8        NaN  22.92
9        NaN  44.31
10  [10, 18]  15.79
1        NaN  10.18
2   [18, 35]  25.16
3        NaN  44.48
4        NaN  85.24
5        NaN  36.71
6        NaN  77.09
7        NaN  81.88
8   [18, 35]  22.92
9        NaN  44.31
10       NaN  15.79
1        NaN  10.18
2        NaN  25.16
3   [35, 50]  44.48
4        NaN  85.24
5   [35, 50]  36.71
6        NaN  77.09
7        NaN  81.88
8        NaN  22.92
9   [35, 50]  44.31
10       NaN  15.79
1   NaN  10.18
2   NaN  25.16
3   NaN  44.48
4   NaN  85.24
5   NaN  36.71
6   NaN  77.09
7   NaN  81.88
8   NaN  22.92
9   NaN  44.31
10  NaN  15.79
1   NaN  10.18
2   NaN  25.16
3   NaN  44.48
4   NaN  85.24
5   NaN  36.71
6   NaN  77.09
7   NaN  81.88
8   NaN  22.92
9   NaN  44.31
10  NaN  15.79

with an error:

Traceback (most recent call last):
  File "groups.py", line 69, in <module>
    xml_parse()
  File "groups.py", line 44, in xml_parse
    df['bin'] = pd.cut(frame[1], [float(minval)], include_lowest= mininc)
  File "C:\Python27\lib\site-packages\pandas\tools\tile.py", line 113, in cut
    include_lowest=include_lowest)
  File "C:\Python27\lib\site-packages\pandas\tools\tile.py", line 203, in _bins_to_cuts
    include_lowest=include_lowest)
  File "C:\Python27\lib\site-packages\pandas\tools\tile.py", line 252, in _format_levels
    levels[0] = '[' + levels[0][1:]
IndexError: list index out of range

Expected output:

1   [10, 18]  10.18
2   [18, 35]  25.16
3   [35, 50]  44.48
4   [>= 75] 85.24 #however >=75 can be represented
5   [35, 50]  36.71
6   [>= 75] 77.09
7   [>= 75] 81.88
8   [18, 35]  22.92
9   [35, 50]  44.31
10  [10, 18]  15.79
like image 593
pam Avatar asked May 13 '16 17:05

pam


People also ask

How do I group values in a column in pandas?

Groupby is a very powerful pandas method. You can group by one column and count the values of another column per this column value using value_counts. Using groupby and value_counts we can count the number of activities each person did.

What does cut do in pandas?

cut() method in Python. Pandas cut() function is used to separate the array elements into different bins . The cut function is mainly used to perform statistical analysis on scalar data.

How do you split data into bins in Python?

Use pd. cut() for binning data based on the range of possible values. Use pd. qcut() for binning data based on the actual distribution of values.

How do you do Groupby in pandas?

The Hello, World! of pandas GroupBy You call . groupby() and pass the name of the column that you want to group on, which is "state" . Then, you use ["last_name"] to specify the columns on which you want to perform the actual aggregation. You can pass a lot more than just a single column name to .


1 Answers

Starting with:

df:

     val1  val2
0     NaN    10
1   10.18     1
2   25.16     1
3   44.48     1
4   85.24     1
5   36.71     1
6   77.09     1
7   81.88     1
8   22.92     1
9   44.31     1
10  15.79     1

and

xml = """
<metaGroups>
    <Groups>
        <GroupID>age</GroupID>
        <description>age</description>
        <groupname>
            <Name>0 - &lt;10</Name>
            <min>0</min>
            <minInc>TRUE</minInc>
            <max>10</max>
            <maxInc>FALSE</maxInc>
        </groupname>
        <groupname>
            <Name>10 - &lt;18</Name>
            <min>10</min>
            <minInc>TRUE</minInc>
            <max>18</max>
            <maxInc>FALSE</maxInc>
        </groupname>
        <groupname>
            <Name>18 - &lt;35</Name>
            <min>18</min>
            <minInc>TRUE</minInc>
            <max>35</max>
            <maxInc>FALSE</maxInc>
        </groupname>
        <groupname>
            <Name>35 - &lt;50</Name>
            <min>35</min>
            <minInc>TRUE</minInc>
            <max>50</max>
            <maxInc>FALSE</maxInc>
        </groupname>
        <groupname>
            <Name>50 - &lt;65</Name>
            <min>50</min>
            <minInc>TRUE</minInc>
            <max>65</max>
            <maxInc>FALSE</maxInc>
        </groupname>
        <groupname>
            <Name>65 - &lt;75</Name>
            <min>65</min>
            <minInc>TRUE</minInc>
            <max>75</max>
            <maxInc>FALSE</maxInc>
        </groupname>
        <groupname>
            <Name>&amp;ge;75</Name>
            <min>75</min>
            <minInc>TRUE</minInc>
        </groupname>
    </Groups>
</metaGroups>
"""

You can use BeautifulSoup to extract the bin parameters, construct the labels and apply pd.cut():

from bs4 import BeautifulSoup as Soup
from itertools import chain

soup = Soup(xml, 'html.parser')

bins = []
for message in soup.findAll('groupname'):
    min = message.find('min').text
    try:
        max = message.find('max').text
        bins.append([min, max])
    except:
        bins.append([min]) # For max bin

at which point we have

bins

[['0', '10'], ['10', '18'], ['18', '35'], ['35', '50'], ['50', '65'], ['65', '75'], ['75']]

Next, we'll flatten the list of list, get rid of duplicates and add an upper bound:

labels = bins
bins = list(np.unique(np.fromiter(chain.from_iterable(bins), dtype='int')))
last = bins[-1]
bins.append(int(df.val1.max() + 1))

which yields:

[0, 10, 18, 35, 50, 65, 75, 86]

Constructing the labels:

labels = ['[{0} - {1}]'.format(label[0], label[1]) if len(label) > 1 else '[ > {} ]'.format(label[0]) for label in labels]

and using pd.cut():

df['binned'] = pd.cut(df.val1, bins=bins, labels=labels)

produces:

     val1  val2     binned
1   10.18     1  [10 - 18]
2   25.16     1  [18 - 35]
3   44.48     1  [35 - 50]
4   85.24     1    [>= 75]
5   36.71     1  [35 - 50]
6   77.09     1    [>= 75]
7   81.88     1    [>= 75]
8   22.92     1  [18 - 35]
9   44.31     1  [35 - 50]
10  15.79     1  [10 - 18]
like image 96
Stefan Avatar answered Sep 28 '22 18:09

Stefan