from bs4 import BeautifulSoup
import codecs
import sys
import urllib.request
site_response= urllib.request.urlopen("http://site/")
html=site_response.read()
file = open ("cars.html","wb") #open file in binary mode
file.write(html)
file.close()
soup = BeautifulSoup(open("cars.html"))
output = (soup.prettify('latin'))
#print(output) #prints whole file for testing
file_output = open ("cars_out.txt","wb")
file_output.write(output)
file_output.close()
fulllist=soup.find_all("div", class_="row vehicle")
#print(fulllist) #prints each row vehicle class for debug
for item in fulllist:
item_print=item.find("span", class_="modelYearSort").string
item_print=item_print + "|" + item.find("span", class_="mmtSort").string
seller_phone=item.find("span", class_="seller-phone")
print(seller_phone)
# item_print=item_print + "|" + item.find("span", class_="seller-phone").string
item_print=item_print + "|" + item.find("span", class_="priceSort").string
item_print=item_print + "|" + item.find("span", class_="milesSort").string
print(item_print)
I have the code above, it parses some html code and generates a pipe delineated file . it works fine except for there are a few entries where one of the elements (seller-phone) is missing from the html code. Not all entries have a seller phone number.
item.find("span", class_="seller-phone").string
I get a failure here. I am not surprised that line fails when seller-phone is missing. I get 'AttributeError' NoneType object has not attribute string.
I am able to do 'item.find' without the '.string' and get back the full block of html. But I can not figure out how to extract the text for those cases.
find() method The find method is used for finding out the first tag with the specified name or id and returning an object of type bs4. Example: For instance, consider this simple HTML webpage having different paragraph tags.
find_all() returns all the tags and strings that match your filters.
Beautiful Soup provides "find()" and "find_all()" functions to get the specific data from the HTML file by putting the specific tag in the function. find() function - return the first element of given tag. find_all() function - return the all the element of given tag.
You're correct, soup.find
returns None
if there's no element found.
You can just put an if/else
clause to avoid this:
for item in fulllist:
span = item.find("span", class_="modelYearSort")
if span:
item_print = span.string
item_print=item_print + "|" + item.find("span", class_="mmtSort").string
seller_phone=item.find("span", class_="seller-phone")
print(seller_phone)
# item_print=item_print + "|" + item.find("span", class_="seller-phone").string
item_print=item_print + "|" + item.find("span", class_="priceSort").string
item_print=item_print + "|" + item.find("span", class_="milesSort").string
print(item_print)
else:
continue #It's empty, go on to the next loop.
Or if you like it, use a try/except
block:
for item in fulllist:
try:
item_print=item.find("span", class_="modelYearSort").string
except AttributeError:
continue #skip to the next loop.
else:
item_print=item_print + "|" + item.find("span", class_="mmtSort").string
seller_phone=item.find("span", class_="seller-phone")
print(seller_phone)
# item_print=item_print + "|" + item.find("span", class_="seller-phone").string
item_print=item_print + "|" + item.find("span", class_="priceSort").string
item_print=item_print + "|" + item.find("span", class_="milesSort").string
print(item_print)
Hope this helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With