Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

turning beautifulsoup output into matrix

I have scraped web data thanks to beautifulsoup, but i'm having trouble turning the output into a matrix/array that i can manipulate.

from bs4 import BeautifulSoup
import urllib2

headers = { 'User-Agent' : 'Mozilla/5.0' }
req = urllib2.Request('http://statsheet.com/mcb/teams/duke/game_stats', None, headers)
html = urllib2.urlopen(req).read()
soup = BeautifulSoup(html)

#statdiv = soup.find('div', attrs={'id': 'basic_stats'})  #not needed
table = soup.find('table', attrs={'class': 'sortable statsb'})
rows = table.findAll('tr')
for tr in rows:
  text = []
  cols = tr.findAll('td')
  for td in cols:
    try:
      text = ''.join(td.find(text=True))
    except Exception:
        text = "000"
    print text+",",
  print

note: ''.join(td.find(text=True)) is to prevent the program from failing on blank cells.

which outputs:

W, GSU, 32, 42, 74, 24-47, 51.1, 15-23, 65.2, 11-24, 45.8, 6, 25, 31, 17, 4, 6, 15, 19,
W, UK, 33, 42, 75, 26-57, 45.6, 15-22, 68.2, 8-18, 44.4, 11, 20, 31, 16, 6, 6, 8, 17,
W, FGCU, 52, 36, 88, 30-63, 47.6, 19-23, 82.6, 9-31, 29.0, 16, 21, 37, 19, 9, 4, 18, 14,
W, @MINN, 40, 49, 89, 30-55, 54.5, 21-26, 80.8, 8-10, 80.0, 10, 22, 32, 12, 12, 4, 15, 21,
W, VCU, 29, 38, 67, 20-48, 41.7, 24-27, 88.9, 3-15, 20.0, 4, 30, 34, 14, 4, 8, 8, 18,
W, Lville, 36, 40, 76, 24-55, 43.6, 23-27, 85.2, 5-20, 25.0, 8, 25, 33, 13, 8, 6, 14, 20,
W, OSU, 23, 50, 73, 24-51, 47.1, 20-27, 74.1, 5-12, 41.7, 8, 29, 37, 11, 3, 5, 8, 19,

which is perfect, only now i cannot figure out how to get the data into a matrix so i can manipulate certain columns, add new columns, etc.

i've been playing with numpy but every time i try i end up with something like this:

[u'W,']
[u'GSU,']
[u'32,']
[u'42,']
[u'74,']
[u'24-47,']
[u'51.1,']
[u'15-23,']
[u'65.2,']
[u'11-24,']
[u'45.8,']

What I want is to take my scraped data and be able to add columns, move columns, change text in columns, split data in one column into two columns (the hyphenated columns).

This is my second day with python. I am assuming putting my data into a matrix/array is the easiest way to do this. Let me know if it isn't.

like image 704
L Shaw Avatar asked Jan 23 '13 19:01

L Shaw


1 Answers

You can use pandas. This example will convert the data into a pandas DataFrame that offers convenient methods for further processing the data, like splitting the columns or converting the them into a different data type.


From the docs:

DataFrame is a 2-dimensional labeled data structure with columns of potentially
different types. You can think of it like a spreadsheet or SQL table, or a dict
of Series objects. It is generally the most commonly used pandas object. Like
Series, DataFrame accepts many different kinds of input.

import pandas as pd

table  = soup.find('table', attrs={'class': 'sortable statsb'})
header = [th.text for th in table.find('thead').select('th')]
header[:2] = ['',' ']
body   = [[td.text for td in row.select('td')]
             for row in table.findAll('tr', attrs = {"onmouseover":"hl(this)"})]
cols   =  zip(*body)
tbl_d  = {name:col for name, col in zip(header,cols)}

print pd.DataFrame(tbl_d, columns = header)

output:

              1H  2H   T     FG   FG%     FT   FT%    3PT    3%  OR  DR REB  AS  ST  B  TO  PF
0   W     GSU  32  42  74  24-47  51.1  15-23  65.2  11-24  45.8   6  25  31  17   4  6  15  19
1   W      UK  33  42  75  26-57  45.6  15-22  68.2   8-18  44.4  11  20  31  16   6  6   8  17
2   W    FGCU  52  36  88  30-63  47.6  19-23  82.6   9-31  29.0  16  21  37  19   9  4  18  14
3   W   @MINN  40  49  89  30-55  54.5  21-26  80.8   8-10  80.0  10  22  32  12  12  4  15  21
4   W     VCU  29  38  67  20-48  41.7  24-27  88.9   3-15  20.0   4  30  34  14   4  8   8  18
5   W  Lville  36  40  76  24-55  43.6  23-27  85.2   5-20  25.0   8  25  33  13   8  6  14  20
6   W     OSU  23  50  73  24-51  47.1  20-27  74.1   5-12  41.7   8  29  37  11   3  5   8  19
7   W      UD  42  46  88  35-67  52.2  11-21  52.4   7-20  35.0  10  37  47  23   9  8  11  19
8   W  TEMPLE  46  44  90  28-59  47.5  22-29  75.9  12-20  60.0  11  27  38  19   6  2   6  15
9   W    CORN  41  47  88  34-60  56.7  13-17  76.5   7-21  33.3   4  26  30  23  11  7  10  11
10  W    ELON  35  41  76  29-67  43.3   7-16  43.8  11-19  57.9  15  28  43  16  10  2  12  13
11  W     SCU  38  52  90  31-63  49.2  21-33  63.6   7-14  50.0  14  26  40  15   6  2  11  17
12  W    @DAV  29  38  67  21-46  45.7  20-22  90.9   5-11  45.5   6  27  33   8   5  5  12  17
13  W     WFU  41  39  80  29-63  46.0  11-22  50.0  11-24  45.8  10  25  35  22   6  4   6  16
14  W    CLEM  25  43  68  27-56  48.2   6-14  42.9   8-15  53.3  13  29  42  13   8  1  13  12
15  L   @NCSU  39  37  76  30-67  44.8  10-12  83.3   6-20  30.0  13  22  35  10   6  4  12  20
16  W      GT  27  46  73  26-65  40.0  11-16  68.8  10-21  47.6  15  25  40  12  10  5  13  18
like image 79
root Avatar answered Nov 08 '22 18:11

root