Parse HTML page to get contents of and tags

Tags:

There are lots of HTML pages which are structured as a sequence of such groups:

<p>
   <b> Keywords/Category:</b>
   "keyword_a, keyword_b"
</p>

The addresses of these pages are like https://some.page.org/year/0001, https://some.page.org/year/0002, etc.

How can I extract the keywords separately from each of such pages? I've tried to use BeautifulSoup, but unsuccessfully. I've only written the program that prints titles of groups (between  and ).

from bs4 import BeautifulSoup
from urllib2 import urlopen
import re
html_doc = urlopen('https://some.page.org/2018/1234').read()
soup = BeautifulSoup(html_doc)
for link in soup.find_all('a'):
    print 'https://some.page.org'+link.get('href')
for node in soup.findAll('b'):
    print ''.join(node.findAll(text=True))

549

asked Dec 31 '18 20:12

Rinat Shakirov

2 Answers

I can't test this without knowing the actual source code format but it seems you want the  tags text vaue:

for node in soup.findAll('p'):
    print(node.text)
    # or: keywords = node.text.split(', ')
    # print(keywords)

194

answered Sep 21 '22 09:09

Danielle M.

You need to split your string which in this case is url with /

And then you can choose chunks you want

For example if url is https://some.page.org/year/0001 i use split function to split url with / sign

it will convert it to array and then i choose what i need and again convert it to string with ''.join() method you can read about split method in this link

answered Sep 18 '22 09:09

Mohammad Ansari

Related questions
                            
                                tensorflow-serving-apis - cannot find python documentation
                            
                                Gradle: How print the dependency artifact urls on console
                            
                                How to randomly get the user agent with the latest browser version? [duplicate]
                            
                                Swift runtime - calling superclass method
                            
                                SQL Server query to update multiple rows with same name
                            
                                How to handle curly braces in RequestParam in spring boot
                            
                                How to open an existing SQLite database with Uri object?
                            
                                How to run (exe) java-8 application in java-11
                            
                                Copy an image to MacOS clipboard using python
                            
                                How to construct a raw SQL Query in EF Core with dynamic number of parameters
                            
                                Add image background to ggplot barplot so that image is only visible inside of bars
                            
                                How to delete a project from Fabric?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Parse HTML page to get contents of <p> and <b> tags