Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting contents from specific meta tags that are not closed using BeautifulSoup

I'm trying to parse out content from specific meta tags. Here's the structure of the meta tags. The first two are closed with a backslash, but the rest don't have any closing tags. As soon as I get the 3rd meta tag, the entire contents between the <head> tags are returned. I've also tried soup.findAll(text=re.compile('keyword')) but that does not return anything since keyword is an attribute of the meta tag.

<meta name="csrf-param" content="authenticity_token"/>
<meta name="csrf-token" content="OrpXIt/y9zdAFHWzJXY2EccDi1zNSucxcCOu8+6Mc9c="/>
<meta content='text/html; charset=UTF-8' http-equiv='Content-Type'>
<meta content='en_US' http-equiv='Content-Language'>
<meta content='c2y_K2CiLmGeet7GUQc9e3RVGp_gCOxUC4IdJg_RBVo' name='google-site-    verification'>
<meta content='initial-scale=1.0,maximum-scale=1.0,width=device-width' name='viewport'>
<meta content='notranslate' name='google'>
<meta content="Learn about Uber's product, founders, investors and team. Everyone's Private Driver - Request a car from any mobile phone—text message, iPhone and Android apps. Within minutes, a professional driver in a sleek black car will arrive curbside. Automatically charged to your credit card on file, tip included." name='description'>

Here's the code:

import csv
import re
import sys
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen

req3 = Request("https://angel.co/uber", headers={'User-Agent': 'Mozilla/5.0')
page3 = urlopen(req3).read()
soup3 = BeautifulSoup(page3)

## This returns the entire web page since the META tags are not closed
desc = soup3.findAll(attrs={"name":"description"}) 
like image 275
tcash21 Avatar asked Aug 08 '13 19:08

tcash21


People also ask

What function in BeautifulSoup allows you to retrieve all instances of an HTML tag?

Beautiful Soup provides "find()" and "find_all()" functions to get the specific data from the HTML file by putting the specific tag in the function. find() function - return the first element of given tag. find_all() function - return the all the element of given tag.

Which method in BeautifulSoup is used for extracting the attributes from HTML?

How do you get attribute value in BeautifulSoup? To extract attributes of elements in Beautiful Soup, use the [~] notation. For instance, el[“id”] retrieves the value of the id attribute.

How do I exclude tags in BeautifulSoup?

Import bs4 library. Create an HTML doc. Parse the content into a BeautifulSoup object. Iterate over the data to remove the tags from the document using decompose() method.


1 Answers

Edited: Added regex for case sensitivity as suggested by @Albert Chen.

Python 3 Edit:

from bs4 import BeautifulSoup
import re
import urllib.request

page3 = urllib.request.urlopen("https://angel.co/uber").read()
soup3 = BeautifulSoup(page3)

desc = soup3.findAll(attrs={"name": re.compile(r"description", re.I)}) 
print(desc[0]['content'])

Although I'm not sure it will work for every page:

from bs4 import BeautifulSoup
import re
import urllib

page3 = urllib.urlopen("https://angel.co/uber").read()
soup3 = BeautifulSoup(page3)

desc = soup3.findAll(attrs={"name": re.compile(r"description", re.I)}) 
print(desc[0]['content'].encode('utf-8'))

Yields:

Learn about Uber's product, founders, investors and team. Everyone's Private Dri
ver - Request a car from any mobile phoneΓÇötext message, iPhone and Android app
s. Within minutes, a professional driver in a sleek black car will arrive curbsi
de. Automatically charged to your credit card on file, tip included.
like image 132
sihrc Avatar answered Nov 09 '22 07:11

sihrc