Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

converting an HTML table in Pandas Dataframe

I am reading an HTML table with pd.read_html but the result is coming in a list, I want to convert it inot a pandas dataframe, so I can continue further operations on the same. I am using the following script

import pandas as pd
import html5lib
data=pd.read_html('http://www.espn.com/nhl/statistics/player/_/stat/points/sort/points/year/2015/seasontype/2',skiprows=1)

and since My results are coming as 1 list, I tried to convert it into a data frame with

data1=pd.DataFrame(Data)

and result came as 0

0       0                       1     2    3    4...

and because of result as a list, I can't apply any functions such as rename, dropna, drop.

I will appreciate every help

like image 323
Manu Sharma Avatar asked Aug 24 '16 10:08

Manu Sharma


2 Answers

I think you need add [0] if need select first item of list, because read_html return list of DataFrames:

So you can use:

import pandas as pd

data1 = pd.read_html('http://www.espn.com/nhl/statis‌​tics/player/‌​_/stat/point‌​s/sort/point‌​s/year/2015&‌​#47;seasontype/2‌​',skiprows=1)[0]
print (data1)

     0                       1     2    3    4    5    6    7    8      9   \
0    RK                  PLAYER  TEAM   GP    G    A  PTS  +/-  PIM  PTS/G   
1     1          Jamie Benn, LW   DAL   82   35   52   87    1   64   1.06   
2     2         John Tavares, C   NYI   82   38   48   86    5   46   1.05   
3     3        Sidney Crosby, C   PIT   77   28   56   84    5   47   1.09   
4     4       Alex Ovechkin, LW   WSH   81   53   28   81   10   58   1.00   
5   NaN       Jakub Voracek, RW   PHI   82   22   59   81    1   78   0.99   
6     6    Nicklas Backstrom, C   WSH   82   18   60   78    5   40   0.95   
7     7         Tyler Seguin, C   DAL   71   37   40   77   -1   20   1.08   
8     8         Jiri Hudler, LW   CGY   78   31   45   76   17   14   0.97   
9   NaN        Daniel Sedin, LW   VAN   82   20   56   76    5   18   0.93   
10   10  Vladimir Tarasenko, RW   STL   77   37   36   73   27   31   0.95   
11  NaN                      PP    SH  NaN  NaN  NaN  NaN  NaN  NaN    NaN   
12   RK                  PLAYER  TEAM   GP    G    A  PTS  +/-  PIM  PTS/G   
13  NaN        Nick Foligno, LW   CBJ   79   31   42   73   16   50   0.92   
14  NaN        Claude Giroux, C   PHI   81   25   48   73   -3   36   0.90   
15  NaN         Henrik Sedin, C   VAN   82   18   55   73   11   22   0.89   
16   14       Steven Stamkos, C    TB   82   43   29   72    2   49   0.88   
17  NaN        Tyler Johnson, C    TB   77   29   43   72   33   24   0.94   
18   16        Ryan Johansen, C   CBJ   82   26   45   71   -6   40   0.87   
19   17         Joe Pavelski, C    SJ   82   37   33   70   12   29   0.85   
20  NaN        Evgeni Malkin, C   PIT   69   28   42   70   -2   60   1.01   
21  NaN         Ryan Getzlaf, C   ANA   77   25   45   70   15   62   0.91   
22   20           Rick Nash, LW   NYR   79   42   27   69   29   36   0.87   
23  NaN                      PP    SH  NaN  NaN  NaN  NaN  NaN  NaN    NaN   
24   RK                  PLAYER  TEAM   GP    G    A  PTS  +/-  PIM  PTS/G   
25   21      Max Pacioretty, LW   MTL   80   37   30   67   38   32   0.84   
26  NaN        Logan Couture, C    SJ   82   27   40   67   -6   12   0.82   
27   23       Jonathan Toews, C   CHI   81   28   38   66   30   36   0.81   
28  NaN        Erik Karlsson, D   OTT   82   21   45   66    7   42   0.80   
29  NaN   Henrik Zetterberg, LW   DET   77   17   49   66   -6   32   0.86   
30   26        Pavel Datsyuk, C   DET   63   26   39   65   12    8   1.03   
31  NaN         Joe Thornton, C    SJ   78   16   49   65   -4   30   0.83   
32   28     Nikita Kucherov, RW    TB   82   28   36   64   38   37   0.78   
33  NaN        Patrick Kane, RW   CHI   61   27   37   64   10   10   1.05   
34  NaN          Mark Stone, RW   OTT   80   26   38   64   21   14   0.80   
35  NaN                      PP    SH  NaN  NaN  NaN  NaN  NaN  NaN    NaN   
36   RK                  PLAYER  TEAM   GP    G    A  PTS  +/-  PIM  PTS/G   
37  NaN     Alexander Steen, LW   STL   74   24   40   64    8   33   0.86   
38  NaN          Kyle Turris, C   OTT   82   24   40   64    5   36   0.78   
39  NaN     Johnny Gaudreau, LW   CGY   80   24   40   64   11   14   0.80   
40  NaN         Anze Kopitar, C    LA   79   16   48   64   -2   10   0.81   
41   35        Radim Vrbata, RW   VAN   79   31   32   63    6   20   0.80   
42  NaN      Jaden Schwartz, LW   STL   75   28   35   63   13   16   0.84   
43  NaN       Filip Forsberg, C   NSH   82   26   37   63   15   24   0.77   
44  NaN       Jordan Eberle, RW   EDM   81   24   39   63  -16   24   0.78   
45  NaN        Ondrej Palat, LW    TB   75   16   47   63   31   24   0.84   
46   40         Zach Parise, LW   MIN   74   33   29   62   21   41   0.84   

     10    11   12   13   14   15   16  
0   SOG   PCT  GWG    G    A    G    A  
1   253  13.8    6   10   13    2    3  
2   278  13.7    8   13   18    0    1  
3   237  11.8    3   10   21    0    0  
4   395  13.4   11   25    9    0    0  
5   221  10.0    3   11   22    0    0  
6   153  11.8    3    3   30    0    0  
7   280  13.2    5   13   16    0    0  
8   158  19.6    5    6   10    0    0  
9   226   8.9    5    4   21    0    0  
10  264  14.0    6    8   10    0    0  
11  NaN   NaN  NaN  NaN  NaN  NaN  NaN  
12  SOG   PCT  GWG    G    A    G    A  
13  182  17.0    3   11   15    0    0  
14  279   9.0    4   14   23    0    0  
15  101  17.8    0    5   20    0    0  
16  268  16.0    6   13   12    0    0  
17  203  14.3    6    8    9    0    0  
18  202  12.9    0    7   19    2    0  
19  261  14.2    5   19   12    0    0  
20  212  13.2    4    9   17    0    0  
21  191  13.1    6    3   10    0    2  
22  304  13.8    8    6    6    4    1  
23  NaN   NaN  NaN  NaN  NaN  NaN  NaN  
24  SOG   PCT  GWG    G    A    G    A  
25  302  12.3   10    7    4    3    2  
26  263  10.3    4    6   18    2    0  
27  192  14.6    7    6   11    2    1  
28  292   7.2    3    6   24    0    0  
29  227   7.5    3    4   24    0    0  
30  165  15.8    5    8   16    0    0  
31  131  12.2    0    4   18    0    0  
32  190  14.7    2    2   13    0    0  
33  186  14.5    5    6   16    0    0  
34  157  16.6    6    5    8    1    0  
35  NaN   NaN  NaN  NaN  NaN  NaN  NaN  
36  SOG   PCT  GWG    G    A    G    A  
37  223  10.8    5    8   16    0    0  
38  215  11.2    6    4   12    1    0  
39  167  14.4    4    8   13    0    0  
40  134  11.9    4    6   18    0    0  
41  267  11.6    7   12   11    0    0  
42  184  15.2    4    8    8    0    2  
43  237  11.0    6    6   13    0    0  
44  183  13.1    2    6   15    0    0  
45  139  11.5    5    3    8    1    1  
46  259  12.7    3   11    5    0    0  
like image 123
jezrael Avatar answered Oct 09 '22 22:10

jezrael


If your dataframe ends up with columns indexed as 0,1,2 etc and the headings in the first row, (as above) just specify that the column names are in the first row with header=0

Without this, pandas may see a mix of data types - text in row 1 and numbers in the rest and cast the column as object rather than, say, int64.

Full line would be:

data1 = pd.read_html(url, skiprows=1, header=0)[0]

[0] is the first table in the list of possible tables.

There are options for handling NA values as well. Check out the documentation here: https://pandas.pydata.org/docs/reference/api/pandas.read_html.html

like image 2
InnocentBystander Avatar answered Oct 09 '22 21:10

InnocentBystander