Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract text from html page?

For example the web page is the link:

https://www.architecture.com/FindAnArchitect/FAAPractices.aspx?display=50

I must have the name of the firms and their address and website. I have tried the following to convert the html to text:

import nltk   
from urllib import urlopen

url = "https://www.architecture.com/FindAnArchitect/FAAPractices.aspx display=50"    
html = urlopen(url).read()    
raw = nltk.clean_html(html)  
print(raw)

But it returns the error:

ImportError: cannot import name 'urlopen
like image 888
Nique Avatar asked Dec 05 '22 02:12

Nique


1 Answers

Peter Wood has answered your problem (link).

import urllib.request

uf = urllib.request.urlopen(url)
html = uf.read()

But if you want to extract data (such as name of the firm, address and website) then you will need to fetch your HTML source and parse it using a HTML parser.

I'd suggest to use requests for fetching the HTML source and BeautifulSoup to parse the HTML generated and extract the text you require.

Here is a small snipet which will give you a head start.

import requests
from bs4 import BeautifulSoup

link = "https://www.architecture.com/FindAnArchitect/FAAPractices.aspx?display=50"

html = requests.get(link).text

"""If you do not want to use requests then you can use the following code below 
   with urllib (the snippet above). It should not cause any issue."""
soup = BeautifulSoup(html, "lxml")
res = soup.findAll("article", {"class": "listingItem"})
for r in res:
    print("Company Name: " + r.find('a').text)
    print("Address: " + r.find("div", {'class': 'address'}).text)
    print("Website: " + r.find_all("div", {'class': 'pageMeta-item'})[3].text)
like image 76
JRodDynamite Avatar answered Dec 24 '22 08:12

JRodDynamite