Parsing HTML with python request

Question

im not a coder but i need to implement a simple HTML parser.

After a simple research i was able to implement as a given example:

from lxml import html
import requests

page = requests.get('https://URL.COM')
tree = html.fromstring(page.content)

#This will create a list of buyers:
buyers = tree.xpath('//div[@title="buyer-name"]/text()')
#This will create a list of prices
prices = tree.xpath('//span[@class="item-price"]/text()')

print 'Buyers: ', buyers
print 'Prices: ', prices

How can i use tree.xpath to parse all words ending with ".com.br" and starting with "://"

C.Nivs · Accepted Answer

As @nosklo pointed out here, you are looking for href tags and the associated links. A parse tree will be organized by the html elements themselves, and you find text by searching those elements specifically. For urls, this would look like so (using the lxml library in python 3.6):

from lxml import etree
from io import StringIO
import requests

# Set explicit HTMLParser
parser = etree.HTMLParser()

page = requests.get('https://URL.COM')

# Decode the page content from bytes to string
html = page.content.decode("utf-8")

# Create your etree with a StringIO object which functions similarly
# to a fileHandler
tree = etree.parse(StringIO(html), parser=parser)

# Call this function and pass in your tree
def get_links(tree):
    # This will get the anchor tags <a href...>
    refs = tree.xpath("//a")
    # Get the url from the ref
    links = [link.get('href', '') for link in refs]
    # Return a list that only ends with .com.br
    return [l for l in links if l.endswith('.com.br')]


# Example call
links = get_links(tree)

Parsing HTML with python request

Tags:

python

parsing

html-parsing

Daniel Oliveira

1 Answers

C.Nivs

Recent Activity

Donate For Us

Parsing HTML with python request

Tags:

python

parsing

html-parsing

Daniel Oliveira

1 Answers

C.Nivs

Related questions

Recent Activity

Donate For Us