python: find html tags and replace their attributes [duplicate]

Question

I need to do the following:

take html document
find every occurrence of 'img' tag
take their 'src' attribute
pass founded url to processing
change the 'src' attribute to the new one
do all this stuff with Python 2.7

P.S. I,ve heard about lmxl and BeautifulSoup. How do you recommend to solve this problem with? Maybe it would be better to use regexes then? or another something else?

Torxed · Accepted Answer

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_string)
for link in soup.findAll('a')
    link['src'] = 'New src'
html_string = str(soup)

I don't particularly like BeautifulSoup but it does the job for you. Try to not over-do your solution if you don't have to, this being one of the simpler things you can do to solve a general issue.

That said, building for the future is equally important but all your 6 requirements can be put down into one, "I want to change 'src' or all links to X"

unutbu · Answer

Using lxml,

import lxml.html as LH
root = LH.fromstring(html_string)
for el in root.iter('img'):
    el.attrib['src'] = 'new src'
print(LH.tostring(root, pretty_print=True))

Parsing HTML with regex is generally a bad idea. Using an HTML parser like BeautifulSoup or lxml.html is a better idea.

One attraction of using BeautifulSoup is that it has a familiar Python interface. There are lots of functions for navigation: find_all, find_next, find_previous, find_parent, find_next_siblings, etc.

Another point in favor of BeautifulSoup is that sometimes BeautifulSoup can parse broken HTML (by inserting missing closing tags, for example) when lxml can not. lxml is more strict and simply raises an exception if the HTML is malformed.

In contrast to the multitude of functions provided by the BeautifulSoup API, lxml mainly uses the XPath mini-language for navigation. Navigation using XPath tends to be more succinct than the equivalent using BeautifulSoup. The problem is you have to learn XPath. lxml is also much much faster than BeautifulSoup.

So if you are just starting out, BeautifulSoup may be easier to use instantly, but in the end I believe lxml is more pleasant to work with.

Games Brainiac · Answer

Just throwing this answer in there, if you wanted to use regexes:

html = """
<!doctype html>
<html lang="en-US">
<head>
    <meta charset="UTF-8">
    <title></title>
</head>
<body>
<script src="http://ajax.googleapis.com/ajax/libs/jquery/1.10.1/jquery.min.js"></script>
<script src="http://ajax.googleapis.com/ajax/libs/jquery/2.0.1/jquery.min.js"></script>
</body>
</html>
"""

import re

find = re.compile(r'src="[^"]*"')

print find.sub('src="changed"', html)

Alexander Zhukov · Answer

Here is an lxml approach:

import lxml.html

filename = 'your_html_filename.html'
document = lxml.html.parse(filename)
tag = 'your_tag_name'
elements = document.xpath('//{}'.format(tag))

for e in elements:
    e.attrib['src'] = 'new value'

result = str(document)

For your particular problem there is no precise advantage in using BS or lxml. This will matter only in the context of your problem.

python: find html tags and replace their attributes [duplicate]

Tags:

python

html

parsing

tags

Victor Shytiuk

4 Answers

Torxed

unutbu

Games Brainiac

Alexander Zhukov

Recent Activity

Donate For Us

python: find html tags and replace their attributes [duplicate]

Tags:

python

html

parsing

tags

Victor Shytiuk

4 Answers

Torxed

unutbu

Games Brainiac

Alexander Zhukov

Related questions

Recent Activity

Donate For Us