Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas.read_html not support decimal comma

I was reading an xlm file using pandas.read_html and works almost perfect, the problem is that the file has commas as decimal separators instead of dots (the default in read_html).

I could easily replace the commas by dots in one file, but i have almost 200 files with that configuration. with pandas.read_csv you can define the decimal separator, but i don't know why in pandas.read_html you can only define the thousand separator.

any guidance in this matter?, there is another way to automate the comma/dot replacement before it is open by pandas? thanks in advance!

like image 666
Pablo Avatar asked Sep 09 '16 13:09

Pablo


4 Answers

This did not start working for me until I used both decimal=',' and thousands='.'

Pandas version: 0.23.4

So try to use both decimal and thousands: i.e.: pd.read_html(io="http://example.com", decimal=',', thousands='.')

Before I would only use decimal=',' and the number columns would be saved as type str with the numbers just omitting the comma.(weird behaviour) For example 0,7 would be "07" and "1,9" would be "19"

It is still being saved in the dataframe as type str but at least I don't have to manually put in the dots. The numbers are correctly displayed; 0,7 -> "0.7"

like image 184
sigurdb Avatar answered Oct 14 '22 00:10

sigurdb


I am using pandas 0.19 but it still fails to correctly convert the numbers.

For example:

a=pd.read_html(r.text,thousands='.',decimal=',')

will recognize the value "1.401,40" in a table cell as 140140 (float).

I use a similar solution as 'Pablo A', just correcting for nan values:

def to_numeric_comma(series):
    new=series.apply(lambda x: str(x).replace('.','').replace(',','.'))
    new=pd.to_numeric(new.replace('nan',pd.np.nan))
    return new
like image 36
pedrovgp Avatar answered Oct 03 '22 12:10

pedrovgp


Looking at the source code of read_html

def read_html(io, match='.+', flavor=None, header=None, index_col=None,
              skiprows=None, attrs=None, parse_dates=False,
              tupleize_cols=False, thousands=',', encoding=None,
              decimal='.', converters=None, na_values=None,
              keep_default_na=True):

The function header implies that there is a decimal separator available in the function call.

Further down in the documentation this looks like it was added in version 0.19 (so a bit further down the experimental branch). Can you upgrade your pandas?

decimal : str, default '.' Character to recognize as decimal point (e.g. use ',' for European data). .. versionadded:: 0.19.0

like image 4
zhqiat Avatar answered Oct 14 '22 02:10

zhqiat


Thanks @zhqiat. I think upgrading pandas to version 0.19 will solve the problem. unfortunately I couldn't found an easy way to accomplish that. I found a tutorial to upgrade Pandas but for ubuntu (winXP user).

I finally chose the workaround, using the method posted here, basically converting all columns, one by one, to a numeric type of pandas.Series

result[col] = result[col].apply(lambda x: x.str.replace(".","").str.replace(",","."))

I know that this solution ain't the best, but works. Thanks

like image 2
Pablo Avatar answered Oct 14 '22 00:10

Pablo