Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HTML Parsing using bs4

I am parsing an HTMl page and am having a hard time figuring out how to pull a certain 'p' tag without a class or on id. I am trying to reach the tag of 'p' with the lat and long. Here is my current code:

 import bs4
 from urllib import urlopen as uReq #this opens the URL
 from bs4 import BeautifulSoup as soup #parses/cuts  the html

 my_url = 'http://www.fortwiki.com/Battery_Adair'
 print(my_url)
 uClient = uReq(my_url) #opens the HTML and stores it in uClients

 page_html = uClient.read() # reads the URL
 uClient.close() # closes the URL

 page_soup = soup(page_html, "html.parser") #parses/cuts the HTML
 containers = page_soup.find_all("table")
 for container in containers:
    title = container.tr.p.b.text.strip()
    history = container.tr.p.text.strip()
      lat_long = container.tr.table
       print(title)
       print(history)
       print(lat_long)

Link to website: http://www.fortwiki.com/Battery_Adair

like image 226
Vlad Bogza Avatar asked Dec 23 '22 08:12

Vlad Bogza


2 Answers

The <p> tag you're looking for is very common in the document, and it doesn't have any unique attributes, so we can't select it directly.

A possible solution would be to select the tag by index, as in bloopiebloopie's answer.
However that won't work unless you know the exact position of the tag.

Another possible solution would be to find a neighbouring tag that has distinguishing attributes/text and select our tag in relation to that.
In this case we can find the previous tag with text: "Maps & Images", and use find_next to select the next tag.

import requests
from bs4 import BeautifulSoup

url = 'http://www.fortwiki.com/Battery_Adair'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")

b = soup.find('b', text='Maps & Images')
if b:
    lat_long = b.find_next().text

This method should find the coordinates data in any www.fortwiki.com page that has a map.

like image 131
t.m.adam Avatar answered Dec 31 '22 13:12

t.m.adam


You can use re to match partial text inside a tag.

import re
import requests
from bs4 import BeautifulSoup

url = 'http://www.fortwiki.com/Battery_Adair'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")

lat_long = soup.find('p', text=re.compile('Lat:\s\d+\.\d+\sLong:')).text
print(lat_long)
# Lat: 24.5477038 Long: -81.8104541
like image 42
Keyur Potdar Avatar answered Dec 31 '22 12:12

Keyur Potdar