Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

parsing table with BeautifulSoup and write in text file

I need data from table in text file (output.txt) in this format: data1;data2;data3;data4;.....

Celkova podlahova plocha bytu;33m;Vytah;Ano;Nadzemne podlazie;Prizemne podlazie;.....;Forma vlastnictva;Osobne

All in "one line", separator is ";" (later export in csv-file).

I´m beginner.. Help, thanks.

from BeautifulSoup import BeautifulSoup
import urllib2
import codecs

response = urllib2.urlopen('http://www.reality.sk/zakazka/0747-003578/predaj/1-izb-byt/kosice-mestska-cast-sever-sladkovicova-kosice-sever/art-real-1-izb-byt-sladkovicova-ul-kosice-sever')
html = response.read()
soup = BeautifulSoup(html)

tabulka = soup.find("table", {"class" : "detail-char"})

for row in tabulka.findAll('tr'):
    col = row.findAll('td')
    prvy = col[0].string.strip()
    druhy = col[1].string.strip()
    record = ([prvy], [druhy])

fl = codecs.open('output.txt', 'wb', 'utf8')
for rec in record:
    line = ''
    for val in rec:
        line += val + u';'
    fl.write(line + u'\r\n')
fl.close()
like image 494
parenthesis Avatar asked Feb 08 '10 20:02

parenthesis


People also ask

How do you get the text of an element in BeautifulSoup?

BeautifulSoup has a built-in method to parse the text out of an element, which is get_text() . In order to use it, you can simply call the method on any Tag or BeautifulSoup object. get_text() does not work on NavigableString because the object itself represents a string.


1 Answers

You are not keeping each record as you read it in. Try this, which stores the records in records:

from BeautifulSoup import BeautifulSoup
import urllib2
import codecs

response = urllib2.urlopen('http://www.reality.sk/zakazka/0747-003578/predaj/1-izb-byt/kosice-mestska-cast-sever-sladkovicova-kosice-sever/art-real-1-izb-byt-sladkovicova-ul-kosice-sever')
html = response.read()
soup = BeautifulSoup(html)

tabulka = soup.find("table", {"class" : "detail-char"})

records = [] # store all of the records in this list
for row in tabulka.findAll('tr'):
    col = row.findAll('td')
    prvy = col[0].string.strip()
    druhy = col[1].string.strip()
    record = '%s;%s' % (prvy, druhy) # store the record with a ';' between prvy and druhy
    records.append(record)

fl = codecs.open('output.txt', 'wb', 'utf8')
line = ';'.join(records)
fl.write(line + u'\r\n')
fl.close()

This could be cleaned up more, but I think it's what you are wanting.

like image 77
pwdyson Avatar answered Sep 28 '22 20:09

pwdyson