Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse HTML page to get contents of <p> and <b> tags

Tags:

There are lots of HTML pages which are structured as a sequence of such groups:

<p>
   <b> Keywords/Category:</b>
   "keyword_a, keyword_b"
</p>

The addresses of these pages are like https://some.page.org/year/0001, https://some.page.org/year/0002, etc.

How can I extract the keywords separately from each of such pages? I've tried to use BeautifulSoup, but unsuccessfully. I've only written the program that prints titles of groups (between <b> and </b>).

from bs4 import BeautifulSoup
from urllib2 import urlopen
import re
html_doc = urlopen('https://some.page.org/2018/1234').read()
soup = BeautifulSoup(html_doc)
for link in soup.find_all('a'):
    print 'https://some.page.org'+link.get('href')
for node in soup.findAll('b'):
    print ''.join(node.findAll(text=True))
like image 549
Rinat Shakirov Avatar asked Dec 31 '18 20:12

Rinat Shakirov


People also ask

What is HTML parsing?

HTML parsing involves tokenization and tree construction. HTML tokens include start and end tags, as well as attribute names and values. If the document is well-formed, parsing it is straightforward and faster. The parser parses tokenized input into the document, building up the document tree.


2 Answers

I can't test this without knowing the actual source code format but it seems you want the <p> tags text vaue:

for node in soup.findAll('p'):
    print(node.text)
    # or: keywords = node.text.split(', ')
    # print(keywords)
like image 194
Danielle M. Avatar answered Sep 21 '22 09:09

Danielle M.


You need to split your string which in this case is url with /

And then you can choose chunks you want

For example if url is https://some.page.org/year/0001 i use split function to split url with / sign

it will convert it to array and then i choose what i need and again convert it to string with ''.join() method you can read about split method in this link

like image 29
Mohammad Ansari Avatar answered Sep 18 '22 09:09

Mohammad Ansari