Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Struggling to scrape a telegram public channel with BeautifulSoup

I'm practicing web scraping with BeautifulSoup but I struggle to finish printing a dictionary including the items I've scraped

The web targeted can be any telegram public channel (web version) and I pretend to collect and add as part of the dictionary the text message, timestamp, views and image url (if exist attached to the post).

I've inspected the code for the 4 elements but the one related to the image url has no class or span, so I've ended scraping them it via regex. The other 3 elements are easily retrievable.

Let's go by parts:

Importing modules

from bs4 import BeautifulSoup
import requests
import re

Function to get the images url from the public channel

def pictures(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'lxml')
    link = str(soup.find_all('a', class_ = 'tgme_widget_message_photo_wrap')) #converted to str in order to be able to apply regex
    image_url = re.findall(r"https://cdn4.*.*.jpg", link)
    return image_url

Soup to get the text message, timestamp and views

picture_list = pictures(url)
url = "https://t.me/s/computer_science_and_programming"

channel = requests.get(url).text
soup = BeautifulSoup(channel, 'lxml')
tgpost = soup.find_all('div', class_ ='tgme_widget_message')
full_message = {}
for content in tgpost:
    full_message['views'] = content.find('span', class_ = 'tgme_widget_message_views').text
    full_message['timestamp'] = content.find('time', class_ = 'time').text
    full_message['text'] = content.find('div', class_ = 'tgme_widget_message_text').text
    print(full_message)

I would really appreciate if someone can help me, I'm new to Python and I don't know how I could do it to

  1. Check if the post contains an image and if so, add it to the dictionary
  2. Print the dictionary including image_url as key and the url as value for each post.

Thank you very much

like image 971
arkanjie Avatar asked Oct 27 '25 14:10

arkanjie


1 Answers

I think you want something like this.

from bs4 import BeautifulSoup
import requests, re

url = "https://t.me/s/computer_science_and_programming"

channel = requests.get(url).text
soup = BeautifulSoup(channel, 'lxml')
tgpost = soup.find_all('div', class_ ='tgme_widget_message')
full_message = {}

for content in tgpost:
    full_message['views'] = content.find('span', class_ = 'tgme_widget_message_views').text
    full_message['timestamp'] = content.find('time', class_ = 'time').text
    full_message['text'] = content.find('div', class_ = 'tgme_widget_message_text').text

    if content.find('a', class_ = 'tgme_widget_message_photo_wrap') != None :
        link = str(content.find('a', class_ = 'tgme_widget_message_photo_wrap'))
        full_message['url_image'] = re.findall(r"https://cdn4.*.*.jpg", link)[0]
    elif 'url_image' in full_message:
        full_message.pop('url_image')

    print(full_message)
like image 110
Arlème Johnson Avatar answered Oct 29 '25 04:10

Arlème Johnson



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!