Suppose I have something like this: <pre class="prettyprint"><code>var = '<li> <a href="/...html">Energy</a> <ul> <li> <a href="/...html">Coal</a> </li> <li> <a href="/...html">Oil </a> </li> <li> <a href="/...html">Carbon</a> </li> <li> <a href="/...html">Oxygen</a> </li' </code></pre> What is the best (most efficient) way to extract the text in between the tags? Should I use regex for this? My current technique relies on splitting the string on <code>li</code> tags and using a <code>for</code> loop, just wondering if there was a faster way to do this.

You can use Beautiful Soup that is very good for this kind of task. It is very straightforward, easy to install and with a large documentation. Your example has some li tags not closed. I already made the corrections and this is how would be to get all the li tags <pre class="prettyprint"><code>from bs4 import BeautifulSoup var = '''<li> <a href="/...html">Energy</a></li> <ul> <li><a href="/...html">Coal</a></li> <li><a href="/...html">Oil </a></li> <li><a href="/...html">Carbon</a></li> <li><a href="/...html">Oxygen</a></li>''' soup = BeautifulSoup(var) for a in soup.find_all('a'): print a.string </code></pre> It will print: <blockquote> Energy Coa Oil Carbon Oxygen </blockquote> For documentation and more examples see the BeautifulSoup doc

Efficient way to extract text from between tags

Tags:

python

regex

extract

Suppose I have something like this:

var = '<li> <a href="/...html">Energy</a>
      <ul>
      <li> <a href="/...html">Coal</a> </li>
      <li> <a href="/...html">Oil </a> </li>
      <li> <a href="/...html">Carbon</a> </li>
      <li> <a href="/...html">Oxygen</a> </li'

What is the best (most efficient) way to extract the text in between the tags? Should I use regex for this? My current technique relies on splitting the string on li tags and using a for loop, just wondering if there was a faster way to do this.

219

asked Jun 19 '13 01:06

Max Kim

2 Answers

The recommended way to extract information from a markup language is to use a parser, for instance Beautiful Soup is a good choice. Avoid using regular expressions for this, it's not the right tool for the job!

164

answered Nov 15 '22 01:11

Óscar López

You can use Beautiful Soup that is very good for this kind of task. It is very straightforward, easy to install and with a large documentation.

Your example has some li tags not closed. I already made the corrections and this is how would be to get all the li tags

from bs4 import BeautifulSoup

var = '''<li> <a href="/...html">Energy</a></li>
    <ul>
    <li><a href="/...html">Coal</a></li>
    <li><a href="/...html">Oil </a></li>
    <li><a href="/...html">Carbon</a></li>
    <li><a href="/...html">Oxygen</a></li>'''

soup = BeautifulSoup(var)

for a in soup.find_all('a'):
  print a.string

It will print:

Energy
Coa
Oil
Carbon
Oxygen

For documentation and more examples see the BeautifulSoup doc

answered Nov 15 '22 02:11

Davi Sampaio

Related questions
                            
                                numpy: split 1D array of chunks separated by nans into a list of the chunks
                            
                                I am trying to loop between two times, from 8:00 to 17:00 for every 15 mins
                            
                                pygame.time.set_timer confusion?
                            
                                Loading a dataset from file, to use with sklearn/numpy, including labels
                            
                                Tastypie Nested Resources - cached_obj_get() takes exactly 2 arguments (1 given)
                            
                                Fitting 3D points python
                            
                                sudo python runs old python version
                            
                                Replace textarea with rich text editor in Flask Admin [closed]
                            
                                How can I handle an alert with GhostDriver via Python?
                            
                                how to correctly pass a json object to flask server using jquery ajax
                            
                                Python -c switch
                            
                                Simpson's rule in Python
                            
                                Get smallest N values from numpy array ignoring inf and nan
                            
                                Using callable(x) vs. hasattr(x, "__call__")
                            
                                slicing numpy array along an arbitrary dimension
                            
                                filename tab-completion in Cmd.cmd of Python
                            
                                Create dictionary from list python
                            
                                Django reportlab latin2 encoding
                            
                                How to find nearest value that is greater in numpy array?
                            
                                Flask error handling: "Response object is not iterable"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With