Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way to extract text from between tags

Suppose I have something like this:

var = '<li> <a href="/...html">Energy</a>
      <ul>
      <li> <a href="/...html">Coal</a> </li>
      <li> <a href="/...html">Oil </a> </li>
      <li> <a href="/...html">Carbon</a> </li>
      <li> <a href="/...html">Oxygen</a> </li'

What is the best (most efficient) way to extract the text in between the tags? Should I use regex for this? My current technique relies on splitting the string on li tags and using a for loop, just wondering if there was a faster way to do this.

like image 219
Max Kim Avatar asked Jun 19 '13 01:06

Max Kim


People also ask

How do I get the contents between HTML tags?

The preg_match() function is the best option to extract text between HTML tags with REGEX in PHP. If you want to get content between tags, use regular expressions with preg_match() function in PHP. You can also extract the content inside element based on class name or ID using PHP.

How do you extract text from a tag in Python?

Using re module this task can be performed. In this we employ, findall() function to extract all the strings by matching appropriate regex built using tag and symbols.


2 Answers

The recommended way to extract information from a markup language is to use a parser, for instance Beautiful Soup is a good choice. Avoid using regular expressions for this, it's not the right tool for the job!

like image 164
Óscar López Avatar answered Nov 15 '22 01:11

Óscar López


You can use Beautiful Soup that is very good for this kind of task. It is very straightforward, easy to install and with a large documentation.

Your example has some li tags not closed. I already made the corrections and this is how would be to get all the li tags

from bs4 import BeautifulSoup

var = '''<li> <a href="/...html">Energy</a></li>
    <ul>
    <li><a href="/...html">Coal</a></li>
    <li><a href="/...html">Oil </a></li>
    <li><a href="/...html">Carbon</a></li>
    <li><a href="/...html">Oxygen</a></li>'''

soup = BeautifulSoup(var)

for a in soup.find_all('a'):
  print a.string

It will print:

Energy
Coa
Oil
Carbon
Oxygen

For documentation and more examples see the BeautifulSoup doc

like image 22
Davi Sampaio Avatar answered Nov 15 '22 02:11

Davi Sampaio