Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python BeautifulSoup replace img src

I'm trying to parse HTML content from site, change a href and img src. A href changed successful, but img src don't.

It changed in variable but not in HTML (post_content):

<p><img alt="alt text" src="https://lifehacker.ru/wp-content/uploads/2016/08/15120903sa_d2__1471520915-630x523.jpg" title="Title"/></p>

Not _http://site.ru...

<p><img alt="alt text" src="http://site.ru/wp-content/uploads/2016/08/15120903sa_d2__1471520915-630x523.jpg" title="Title"/></p>

My code

if "app-store" not in url:
        r = requests.get("https://lifehacker.ru/2016/08/23/kak-vybrat-trimmer/")
        soup = BeautifulSoup(r.content)

        post_content = soup.find("div", {"class", "post-content"})
        for tag in post_content():
            for attribute in ["class", "id", "style", "height", "width", "sizes"]:
                del tag[attribute]

        for a in post_content.find_all('a'):
            a['href'] = a['href'].replace("https://lifehacker.ru", "http://site.ru")

        for img in post_content.find_all('img'):
            img_urls = img['src']
            if "https:" not in img_urls:
                img_urls="http:{}".format(img_urls)
            thumb_url = img_urls.split('/')
            urllib.urlretrieve(img_urls, "/Users/kr/PycharmProjects/education_py/{}/{}".format(folder_name, thumb_url[-1]))

            file_url = "/Users/kr/PycharmProjects/education_py/{}/{}".format(folder_name, thumb_url[-1])
            data = {
                'name': '{}'.format(thumb_url[-1]),
                'type': 'image/jpeg',
            }

            with open(file_url, 'rb') as img:
                data['bits'] = xmlrpc_client.Binary(img.read())


            response = client.call(media.UploadFile(data))

            attachment_url = response['url']


            img_urls = img_urls.replace(img_urls, attachment_url)



        [s.extract() for s in post_content('script')]
        post_content_insert = bleach.clean(post_content)
        post_content_insert = post_content_insert.replace('&lt;', '<')
        post_content_insert = post_content_insert.replace('&gt;', '>')

        print post_content_insert
like image 321
Konstantin Rusanov Avatar asked Oct 18 '22 02:10

Konstantin Rusanov


1 Answers

Looks like you're never assigning img_urls back to img['src']. Try doing that at the end of the block.

img_urls = img_urls.replace(img_urls, attachment_url)
img['src'] = img_urls

... But first, you need to change your with statement so it uses some name other than img for your file object. Right now you're overshadowing the dom element and you can no longer access it.

        with open(file_url, 'rb') as some_file:
            data['bits'] = xmlrpc_client.Binary(some_file.read())
like image 52
Kevin Avatar answered Oct 21 '22 00:10

Kevin