It seems that there is no way of obtaining input tables (from html / xls / etc files) to DataFrame objects as it is 1-to-1 without any field conversions applied internally by pandas.
Assume the following html table saved with the extension of .xls file, how would we get the same representation of this table in Python memory with DataFrame object?
The content of "test_file.xls":
<body>
<table>
<thead>
<tr>
<th class="tabHead" x:autofilter="all">Number</th>
</tr>
</thead>
<tbody>
<tr>
<td class="tDetail">1.320,00</td>
</tr>
<tr>
<td class="tDetail">600,00</td>
</tr>
</tbody>
</table>
</body>
(1) Straightforward reading of the file
Processing code:
import pandas
df = pandas.read_html('test_file.xls')
print(df[0])
print(df[0].dtypes)
Output:
Number
0 1.32
1 60000.00
Number float64
dtype: object
As we can see the numbers were converter to float64 with some predefined logic. I think this logic includes locales settings, maybe some rules inside pandas, etc. Specifying string convertors directly doesn't allow to obtain the initial values.
(2) Applying str function as a convertor for each dimension
Processing code:
converters = {column_name: str for column_name in df[0].dtypes.index}
df = pandas.read_html(f, converters = converters)
print(df[0])
print(df[0].dtypes)
Output:
Number
0 1.32000
1 60000
Number object
dtype: obje
Obviously, the expected output of this problem is:
Number
0 1.320,00
1 600,00
There could be cases when one file contains numbers typed in different formats (American / European / etc). This numbers differs with decimal mark, thousand mark, etc. So the logical way to handle such files will be to extract the data "as it is" in strings and perform parsing with regexps / other modules separately for each row. Is there a way how to do it in pandas? And are there any other approaches how to handle such file's processing? Thanks guys!
Remarks: Specification of "decimal" and "thousands" parameters for pandas.read_* doesn't look like a reliable solution because it is appled for all fields. Quick example: it can treat date fields in "02.2017" format as numbers and convert it to "022017".
You must specify your thousands and decimal separator. this worked for me:
import pandas as pd
html = """
<body>
<table>
<thead>
<tr>
<th class="tabHead" x:autofilter="all">Number</th>
</tr>
</thead>
<tbody>
<tr>
<td class="tDetail">1.320,00</td>
</tr>
<tr>
<td class="tDetail">600,00</td>
</tr>
</tbody>
</table>
</body>
"""
df = pd.read_html(html, decimal=",", thousands=".")
print(df)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With