HTML Parsing using bs4

Question

I am parsing an HTMl page and am having a hard time figuring out how to pull a certain 'p' tag without a class or on id. I am trying to reach the tag of 'p' with the lat and long. Here is my current code:

 import bs4
 from urllib import urlopen as uReq #this opens the URL
 from bs4 import BeautifulSoup as soup #parses/cuts  the html

 my_url = 'http://www.fortwiki.com/Battery_Adair'
 print(my_url)
 uClient = uReq(my_url) #opens the HTML and stores it in uClients

 page_html = uClient.read() # reads the URL
 uClient.close() # closes the URL

 page_soup = soup(page_html, "html.parser") #parses/cuts the HTML
 containers = page_soup.find_all("table")
 for container in containers:
    title = container.tr.p.b.text.strip()
    history = container.tr.p.text.strip()
      lat_long = container.tr.table
       print(title)
       print(history)
       print(lat_long)

Link to website: http://www.fortwiki.com/Battery_Adair

t.m.adam · Accepted Answer

The <p> tag you're looking for is very common in the document, and it doesn't have any unique attributes, so we can't select it directly.

A possible solution would be to select the tag by index, as in bloopiebloopie's answer.
However that won't work unless you know the exact position of the tag.

Another possible solution would be to find a neighbouring tag that has distinguishing attributes/text and select our tag in relation to that.
In this case we can find the previous tag with text: "Maps & Images", and use find_next to select the next tag.

import requests
from bs4 import BeautifulSoup

url = 'http://www.fortwiki.com/Battery_Adair'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")

b = soup.find('b', text='Maps & Images')
if b:
    lat_long = b.find_next().text

This method should find the coordinates data in any www.fortwiki.com page that has a map.

Keyur Potdar · Answer

You can use re to match partial text inside a tag.

import re
import requests
from bs4 import BeautifulSoup

url = 'http://www.fortwiki.com/Battery_Adair'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")

lat_long = soup.find('p', text=re.compile('Lat:\s\d+\.\d+\sLong:')).text
print(lat_long)
# Lat: 24.5477038 Long: -81.8104541

HTML Parsing using bs4

Tags:

python

beautifulsoup

Vlad Bogza

2 Answers

t.m.adam

Keyur Potdar

Recent Activity

Donate For Us

HTML Parsing using bs4

Tags:

python

beautifulsoup

Vlad Bogza

2 Answers

t.m.adam

Keyur Potdar

Related questions

Recent Activity

Donate For Us