Pandas converting numbers to strings - unexpected results

Question

It seems that there is no way of obtaining input tables (from html / xls / etc files) to DataFrame objects as it is 1-to-1 without any field conversions applied internally by pandas.

Assume the following html table saved with the extension of .xls file, how would we get the same representation of this table in Python memory with DataFrame object?

The content of "test_file.xls":

<body>
    <table>
        <thead>
            <tr>
                <th class="tabHead" x:autofilter="all">Number</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td class="tDetail">1.320,00</td>
            </tr>
            <tr>
                <td class="tDetail">600,00</td>
            </tr>
        </tbody>
    </table>
</body>

(1) Straightforward reading of the file

Processing code:

import pandas

df = pandas.read_html('test_file.xls')
print(df[0])
print(df[0].dtypes)

Output:

     Number
0      1.32
1  60000.00

Number    float64
dtype: object

As we can see the numbers were converter to float64 with some predefined logic. I think this logic includes locales settings, maybe some rules inside pandas, etc. Specifying string convertors directly doesn't allow to obtain the initial values.

(2) Applying str function as a convertor for each dimension

Processing code:

converters = {column_name: str for column_name in df[0].dtypes.index}
df = pandas.read_html(f, converters = converters)
print(df[0])
print(df[0].dtypes)

Output:

    Number
0  1.32000
1    60000

Number    object
dtype: obje

Obviously, the expected output of this problem is:

     Number
0  1.320,00
1    600,00

There could be cases when one file contains numbers typed in different formats (American / European / etc). This numbers differs with decimal mark, thousand mark, etc. So the logical way to handle such files will be to extract the data "as it is" in strings and perform parsing with regexps / other modules separately for each row. Is there a way how to do it in pandas? And are there any other approaches how to handle such file's processing? Thanks guys!

Remarks: Specification of "decimal" and "thousands" parameters for pandas.read_* doesn't look like a reliable solution because it is appled for all fields. Quick example: it can treat date fields in "02.2017" format as numbers and convert it to "022017".

bravhek · Accepted Answer

You must specify your thousands and decimal separator. this worked for me:

import pandas as pd

html = """
<body>
    <table>
        <thead>
            <tr>
                <th class="tabHead" x:autofilter="all">Number</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td class="tDetail">1.320,00</td>
            </tr>
            <tr>
                <td class="tDetail">600,00</td>
            </tr>
        </tbody>
    </table>
</body>
"""

df = pd.read_html(html, decimal=",", thousands=".")
print(df)

Pandas converting numbers to strings - unexpected results

Tags:

python

pandas

dataframe

file-conversion

Pleeea

1 Answers

bravhek

Recent Activity

Donate For Us

Pandas converting numbers to strings - unexpected results

Tags:

python

pandas

dataframe

file-conversion

Pleeea

1 Answers

bravhek

Related questions

Recent Activity

Donate For Us