pandas.read_html not support decimal comma

Question

I was reading an xlm file using pandas.read_html and works almost perfect, the problem is that the file has commas as decimal separators instead of dots (the default in read_html).

I could easily replace the commas by dots in one file, but i have almost 200 files with that configuration. with pandas.read_csv you can define the decimal separator, but i don't know why in pandas.read_html you can only define the thousand separator.

any guidance in this matter?, there is another way to automate the comma/dot replacement before it is open by pandas? thanks in advance!

sigurdb · Accepted Answer

This did not start working for me until I used both decimal=',' and thousands='.'

Pandas version: 0.23.4

So try to use both decimal and thousands: i.e.: pd.read_html(io="http://example.com", decimal=',', thousands='.')

Before I would only use decimal=',' and the number columns would be saved as type str with the numbers just omitting the comma.(weird behaviour) For example 0,7 would be "07" and "1,9" would be "19"

It is still being saved in the dataframe as type str but at least I don't have to manually put in the dots. The numbers are correctly displayed; 0,7 -> "0.7"

pedrovgp · Answer

I am using pandas 0.19 but it still fails to correctly convert the numbers.

For example:

a=pd.read_html(r.text,thousands='.',decimal=',')

will recognize the value "1.401,40" in a table cell as 140140 (float).

I use a similar solution as 'Pablo A', just correcting for nan values:

def to_numeric_comma(series):
    new=series.apply(lambda x: str(x).replace('.','').replace(',','.'))
    new=pd.to_numeric(new.replace('nan',pd.np.nan))
    return new

zhqiat · Answer

Looking at the source code of read_html

def read_html(io, match='.+', flavor=None, header=None, index_col=None,
              skiprows=None, attrs=None, parse_dates=False,
              tupleize_cols=False, thousands=',', encoding=None,
              decimal='.', converters=None, na_values=None,
              keep_default_na=True):

The function header implies that there is a decimal separator available in the function call.

Further down in the documentation this looks like it was added in version 0.19 (so a bit further down the experimental branch). Can you upgrade your pandas?

decimal : str, default '.' Character to recognize as decimal point (e.g. use ',' for European data). .. versionadded:: 0.19.0

Pablo · Answer

Thanks @zhqiat. I think upgrading pandas to version 0.19 will solve the problem. unfortunately I couldn't found an easy way to accomplish that. I found a tutorial to upgrade Pandas but for ubuntu (winXP user).

I finally chose the workaround, using the method posted here, basically converting all columns, one by one, to a numeric type of pandas.Series

result[col] = result[col].apply(lambda x: x.str.replace(".","").str.replace(",","."))

I know that this solution ain't the best, but works. Thanks

pandas.read_html not support decimal comma

Tags:

python

pandas

decimal

xlm

Pablo

4 Answers

sigurdb

pedrovgp

zhqiat

Pablo

Recent Activity

Donate For Us

pandas.read_html not support decimal comma

Tags:

python

pandas

decimal

xlm

Pablo

4 Answers

sigurdb

pedrovgp

zhqiat

Pablo

Related questions

Recent Activity

Donate For Us