I have scraped web data thanks to beautifulsoup, but i'm having trouble turning the output into a matrix/array that i can manipulate.
from bs4 import BeautifulSoup
import urllib2
headers = { 'User-Agent' : 'Mozilla/5.0' }
req = urllib2.Request('http://statsheet.com/mcb/teams/duke/game_stats', None, headers)
html = urllib2.urlopen(req).read()
soup = BeautifulSoup(html)
#statdiv = soup.find('div', attrs={'id': 'basic_stats'}) #not needed
table = soup.find('table', attrs={'class': 'sortable statsb'})
rows = table.findAll('tr')
for tr in rows:
text = []
cols = tr.findAll('td')
for td in cols:
try:
text = ''.join(td.find(text=True))
except Exception:
text = "000"
print text+",",
print
note: ''.join(td.find(text=True))
is to prevent the program from failing on blank cells.
which outputs:
W, GSU, 32, 42, 74, 24-47, 51.1, 15-23, 65.2, 11-24, 45.8, 6, 25, 31, 17, 4, 6, 15, 19,
W, UK, 33, 42, 75, 26-57, 45.6, 15-22, 68.2, 8-18, 44.4, 11, 20, 31, 16, 6, 6, 8, 17,
W, FGCU, 52, 36, 88, 30-63, 47.6, 19-23, 82.6, 9-31, 29.0, 16, 21, 37, 19, 9, 4, 18, 14,
W, @MINN, 40, 49, 89, 30-55, 54.5, 21-26, 80.8, 8-10, 80.0, 10, 22, 32, 12, 12, 4, 15, 21,
W, VCU, 29, 38, 67, 20-48, 41.7, 24-27, 88.9, 3-15, 20.0, 4, 30, 34, 14, 4, 8, 8, 18,
W, Lville, 36, 40, 76, 24-55, 43.6, 23-27, 85.2, 5-20, 25.0, 8, 25, 33, 13, 8, 6, 14, 20,
W, OSU, 23, 50, 73, 24-51, 47.1, 20-27, 74.1, 5-12, 41.7, 8, 29, 37, 11, 3, 5, 8, 19,
which is perfect, only now i cannot figure out how to get the data into a matrix so i can manipulate certain columns, add new columns, etc.
i've been playing with numpy but every time i try i end up with something like this:
[u'W,']
[u'GSU,']
[u'32,']
[u'42,']
[u'74,']
[u'24-47,']
[u'51.1,']
[u'15-23,']
[u'65.2,']
[u'11-24,']
[u'45.8,']
What I want is to take my scraped data and be able to add columns, move columns, change text in columns, split data in one column into two columns (the hyphenated columns).
This is my second day with python. I am assuming putting my data into a matrix/array is the easiest way to do this. Let me know if it isn't.
You can use pandas. This example will convert the data into a pandas DataFrame
that offers convenient methods for further processing the data, like splitting the columns or converting the them into a different data type.
From the docs:
DataFrame is a 2-dimensional labeled data structure with columns of potentially
different types. You can think of it like a spreadsheet or SQL table, or a dict
of Series objects. It is generally the most commonly used pandas object. Like
Series, DataFrame accepts many different kinds of input.
import pandas as pd
table = soup.find('table', attrs={'class': 'sortable statsb'})
header = [th.text for th in table.find('thead').select('th')]
header[:2] = ['',' ']
body = [[td.text for td in row.select('td')]
for row in table.findAll('tr', attrs = {"onmouseover":"hl(this)"})]
cols = zip(*body)
tbl_d = {name:col for name, col in zip(header,cols)}
print pd.DataFrame(tbl_d, columns = header)
output:
1H 2H T FG FG% FT FT% 3PT 3% OR DR REB AS ST B TO PF
0 W GSU 32 42 74 24-47 51.1 15-23 65.2 11-24 45.8 6 25 31 17 4 6 15 19
1 W UK 33 42 75 26-57 45.6 15-22 68.2 8-18 44.4 11 20 31 16 6 6 8 17
2 W FGCU 52 36 88 30-63 47.6 19-23 82.6 9-31 29.0 16 21 37 19 9 4 18 14
3 W @MINN 40 49 89 30-55 54.5 21-26 80.8 8-10 80.0 10 22 32 12 12 4 15 21
4 W VCU 29 38 67 20-48 41.7 24-27 88.9 3-15 20.0 4 30 34 14 4 8 8 18
5 W Lville 36 40 76 24-55 43.6 23-27 85.2 5-20 25.0 8 25 33 13 8 6 14 20
6 W OSU 23 50 73 24-51 47.1 20-27 74.1 5-12 41.7 8 29 37 11 3 5 8 19
7 W UD 42 46 88 35-67 52.2 11-21 52.4 7-20 35.0 10 37 47 23 9 8 11 19
8 W TEMPLE 46 44 90 28-59 47.5 22-29 75.9 12-20 60.0 11 27 38 19 6 2 6 15
9 W CORN 41 47 88 34-60 56.7 13-17 76.5 7-21 33.3 4 26 30 23 11 7 10 11
10 W ELON 35 41 76 29-67 43.3 7-16 43.8 11-19 57.9 15 28 43 16 10 2 12 13
11 W SCU 38 52 90 31-63 49.2 21-33 63.6 7-14 50.0 14 26 40 15 6 2 11 17
12 W @DAV 29 38 67 21-46 45.7 20-22 90.9 5-11 45.5 6 27 33 8 5 5 12 17
13 W WFU 41 39 80 29-63 46.0 11-22 50.0 11-24 45.8 10 25 35 22 6 4 6 16
14 W CLEM 25 43 68 27-56 48.2 6-14 42.9 8-15 53.3 13 29 42 13 8 1 13 12
15 L @NCSU 39 37 76 30-67 44.8 10-12 83.3 6-20 30.0 13 22 35 10 6 4 12 20
16 W GT 27 46 73 26-65 40.0 11-16 68.8 10-21 47.6 15 25 40 12 10 5 13 18
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With