There are lots of HTML pages which are structured as a sequence of such groups:
<p>
<b> Keywords/Category:</b>
"keyword_a, keyword_b"
</p>
The addresses of these pages are like https://some.page.org/year/0001, https://some.page.org/year/0002, etc.
How can I extract the keywords separately from each of such pages? I've tried to use BeautifulSoup, but unsuccessfully. I've only written the program that prints titles of groups (between <b>
and </b>
).
from bs4 import BeautifulSoup
from urllib2 import urlopen
import re
html_doc = urlopen('https://some.page.org/2018/1234').read()
soup = BeautifulSoup(html_doc)
for link in soup.find_all('a'):
print 'https://some.page.org'+link.get('href')
for node in soup.findAll('b'):
print ''.join(node.findAll(text=True))
HTML parsing involves tokenization and tree construction. HTML tokens include start and end tags, as well as attribute names and values. If the document is well-formed, parsing it is straightforward and faster. The parser parses tokenized input into the document, building up the document tree.
I can't test this without knowing the actual source code format but it seems you want the <p>
tags text
vaue:
for node in soup.findAll('p'):
print(node.text)
# or: keywords = node.text.split(', ')
# print(keywords)
You need to split your string which in this case is url with /
And then you can choose chunks you want
For example if url is https://some.page.org/year/0001 i use split function to split url with / sign
it will convert it to array and then i choose what i need and again convert it to string with ''.join()
method you can read about split method in this link
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With