Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python: find html tags and replace their attributes [duplicate]

I need to do the following:

  1. take html document
  2. find every occurrence of 'img' tag
  3. take their 'src' attribute
  4. pass founded url to processing
  5. change the 'src' attribute to the new one
  6. do all this stuff with Python 2.7

P.S. I,ve heard about lmxl and BeautifulSoup. How do you recommend to solve this problem with? Maybe it would be better to use regexes then? or another something else?

like image 747
Victor Shytiuk Avatar asked Oct 14 '13 09:10

Victor Shytiuk


4 Answers

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_string)
for link in soup.findAll('a')
    link['src'] = 'New src'
html_string = str(soup)

I don't particularly like BeautifulSoup but it does the job for you. Try to not over-do your solution if you don't have to, this being one of the simpler things you can do to solve a general issue.

That said, building for the future is equally important but all your 6 requirements can be put down into one, "I want to change 'src' or all links to X"

like image 117
Torxed Avatar answered Oct 22 '22 13:10

Torxed


Using lxml,

import lxml.html as LH
root = LH.fromstring(html_string)
for el in root.iter('img'):
    el.attrib['src'] = 'new src'
print(LH.tostring(root, pretty_print=True))

Parsing HTML with regex is generally a bad idea. Using an HTML parser like BeautifulSoup or lxml.html is a better idea.

One attraction of using BeautifulSoup is that it has a familiar Python interface. There are lots of functions for navigation: find_all, find_next, find_previous, find_parent, find_next_siblings, etc.

Another point in favor of BeautifulSoup is that sometimes BeautifulSoup can parse broken HTML (by inserting missing closing tags, for example) when lxml can not. lxml is more strict and simply raises an exception if the HTML is malformed.

In contrast to the multitude of functions provided by the BeautifulSoup API, lxml mainly uses the XPath mini-language for navigation. Navigation using XPath tends to be more succinct than the equivalent using BeautifulSoup. The problem is you have to learn XPath. lxml is also much much faster than BeautifulSoup.

So if you are just starting out, BeautifulSoup may be easier to use instantly, but in the end I believe lxml is more pleasant to work with.

like image 42
unutbu Avatar answered Oct 22 '22 15:10

unutbu


Just throwing this answer in there, if you wanted to use regexes:

html = """
<!doctype html>
<html lang="en-US">
<head>
    <meta charset="UTF-8">
    <title></title>
</head>
<body>
<script src="http://ajax.googleapis.com/ajax/libs/jquery/1.10.1/jquery.min.js"></script>
<script src="http://ajax.googleapis.com/ajax/libs/jquery/2.0.1/jquery.min.js"></script>
</body>
</html>
"""

import re

find = re.compile(r'src="[^"]*"')

print find.sub('src="changed"', html)
like image 2
Games Brainiac Avatar answered Oct 22 '22 14:10

Games Brainiac


Here is an lxml approach:

import lxml.html

filename = 'your_html_filename.html'
document = lxml.html.parse(filename)
tag = 'your_tag_name'
elements = document.xpath('//{}'.format(tag))

for e in elements:
    e.attrib['src'] = 'new value'

result = str(document)

For your particular problem there is no precise advantage in using BS or lxml. This will matter only in the context of your problem.

like image 1
Alexander Zhukov Avatar answered Oct 22 '22 13:10

Alexander Zhukov