Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

data scraping on discord using python

I'm currently trying to learn web scraping and decided to scrape some discord data. Code follows:

import requests
import json

def retrieve_messages(channelid):
    num=0
    headers = {
        'authorization': 'here we enter the authorization code'
    }
    r = requests.get(
        f'https://discord.com/api/v9/channels/{channelid}/messages?limit=100',headers=headers
        )
    jsonn = json.loads(r.text)
    for value in jsonn:
        print(value['content'], '\n')
        num=num+1
    print('number of messages we collected is',num)

retrieve_messages('server id goes here')

The problem: when I tried changing the limit here messages?limit=100 apparently it only accepts numbers between 0 and 100, meaning that the maximum number of messages I can get is 100. I tried changing this number to 900, for example, to scrape more messages. But then I get the error TypeError: string indices must be integers.

Any ideas on how I could get, possibly, all the messages in a channel?

Thank you very much for reading!

like image 471
WildDracula Avatar asked Oct 26 '25 07:10

WildDracula


1 Answers

APIs that return a bunch of records are almost always limited to some number of items. Otherwise, if a large quantity of items is requested, the API may fail due to being out of memory.

For that purpose, most APIs implement pagination using limit, before and after parameters where:

  • limit: tells you how many messages to fetch
  • before: get messages before this message ID
  • after: get messages after this message ID

Discord API is no exception as the documentation tells us. Here's how you do it:

First, you will need to query the data multiple times. For that, you can use a while loop. Make sure to add an if the condition that will prevent the loop from running indefinitely - I added a check whether there are any messages left.

    while True:
        # ... requests code
        jsonn = json.loads(r.text)
        if len(jsonn) == 0:
            break
        
        for value in jsonn:
            print(value['content'], '\n')
            num=num+1

Define a variable that has the last message that you fetched and save the last message id that you already printed

        
def retrieve_messages(channelid):
    last_message_id = None

    while True:
        # ...
        
        for value in jsonn:
            print(value['content'], '\n')
            last_message_id = value['id']
            num=num+1

Now on the first run the last_message_id is None, and on subsequent requests it has the last message you printed.

Use that to build your query

    while True:
        query_parameters = f'limit={limit}'
        if last_message_id is not None:
            query_parameters += f'&before={last_message_id}'

        r = requests.get(
            f'https://discord.com/api/v9/channels/{channelid}/messages?{query_parameters}',headers=headers
            )
        
        # ...

Note: discord servers give you the latest message first, so you have to use the before parameter

Here's a fully working example of your code

import requests
import json

def retrieve_messages(channelid):
    num = 0
    limit = 10

    headers = {
        'authorization': 'auth header here'
    }

    last_message_id = None

    while True:
        query_parameters = f'limit={limit}'
        if last_message_id is not None:
            query_parameters += f'&before={last_message_id}'

        r = requests.get(
            f'https://discord.com/api/v9/channels/{channelid}/messages?{query_parameters}',headers=headers
            )
        jsonn = json.loads(r.text)
        if len(jsonn) == 0:
            break

        for value in jsonn:
            print(value['content'], '\n')
            last_message_id = value['id']
            num=num+1

    print('number of messages we collected is',num)

retrieve_messages('server id here')
like image 99
Zyy Avatar answered Oct 28 '25 19:10

Zyy



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!