Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas converting numbers to strings - unexpected results

It seems that there is no way of obtaining input tables (from html / xls / etc files) to DataFrame objects as it is 1-to-1 without any field conversions applied internally by pandas.

Assume the following html table saved with the extension of .xls file, how would we get the same representation of this table in Python memory with DataFrame object?

The content of "test_file.xls":

<body>
    <table>
        <thead>
            <tr>
                <th class="tabHead" x:autofilter="all">Number</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td class="tDetail">1.320,00</td>
            </tr>
            <tr>
                <td class="tDetail">600,00</td>
            </tr>
        </tbody>
    </table>
</body>

(1) Straightforward reading of the file

Processing code:

import pandas

df = pandas.read_html('test_file.xls')
print(df[0])
print(df[0].dtypes)

Output:

     Number
0      1.32
1  60000.00

Number    float64
dtype: object

As we can see the numbers were converter to float64 with some predefined logic. I think this logic includes locales settings, maybe some rules inside pandas, etc. Specifying string convertors directly doesn't allow to obtain the initial values.

(2) Applying str function as a convertor for each dimension

Processing code:

converters = {column_name: str for column_name in df[0].dtypes.index}
df = pandas.read_html(f, converters = converters)
print(df[0])
print(df[0].dtypes)

Output:

    Number
0  1.32000
1    60000

Number    object
dtype: obje

Obviously, the expected output of this problem is:

     Number
0  1.320,00
1    600,00

There could be cases when one file contains numbers typed in different formats (American / European / etc). This numbers differs with decimal mark, thousand mark, etc. So the logical way to handle such files will be to extract the data "as it is" in strings and perform parsing with regexps / other modules separately for each row. Is there a way how to do it in pandas? And are there any other approaches how to handle such file's processing? Thanks guys!

Remarks: Specification of "decimal" and "thousands" parameters for pandas.read_* doesn't look like a reliable solution because it is appled for all fields. Quick example: it can treat date fields in "02.2017" format as numbers and convert it to "022017".

like image 792
Pleeea Avatar asked Nov 16 '17 11:11

Pleeea


1 Answers

You must specify your thousands and decimal separator. this worked for me:

import pandas as pd

html = """
<body>
    <table>
        <thead>
            <tr>
                <th class="tabHead" x:autofilter="all">Number</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td class="tDetail">1.320,00</td>
            </tr>
            <tr>
                <td class="tDetail">600,00</td>
            </tr>
        </tbody>
    </table>
</body>
"""

df = pd.read_html(html, decimal=",", thousands=".")
print(df)
like image 74
bravhek Avatar answered Oct 03 '22 14:10

bravhek