Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python - BeautifulSoup - Selecting a 'div' with 'class'-attribute shows every div in the html

I'm trying to crawl coinmarketcap.com with BeautifulSoup (I know there is an API, for training purposes, I want to use BeautifulSoup). Every piece of information crawled so far was pretty easy to select, but now I like to get the "Holder Statistics" looking like this:

holder stats

My testing code for selecting the specific div containing the desired information looks like this:

import requests
from bs4 import BeautifulSoup

url = 'https://coinmarketcap.com/currencies/bitcoin/holders/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
holders = soup.select('div', class_='n0m7sa-0 kkBhMM')
print(holders)

The output of print(holders) is not the expected content of the div, but rather the whole html content of the website. I append a picture of this because the output code would be too long.

Output Code

Does anybody know, why this is the case?

like image 297
MxclFhn Avatar asked Nov 18 '25 19:11

MxclFhn


1 Answers

You should use .select() when you want to use as css selector. In this case, holders = soup.select('div', class_='n0m7sa-0 kkBhMM') the class part is essentially ignored...and it finds all the <div> with any class. To specify that particular class use either the .find_all(), or change your .select()

holders = soup.select('div.n0m7sa-0.kkBhMM')

or

holders = soup.find_all('div', class_='n0m7sa-0 kkBhMM')

Now in both of these cases, it will return None or an empty list. That is because that class attribute is not in the source html. This site is dynamic, so those classes are generated after the initial request. So you either need to use Selenium to render the page first, then pull the html, or see if there's an api to get the data source directly.

There is an api to get the data:

import requests
import pandas as pd

alpha = ['count', 'ratio']
payload = {
'id': '1',
'range': '7d'}


for each in alpha:
        url = f'https://api.coinmarketcap.com/data-api/v3/cryptocurrency/detail/holders/{each}'
        jsonData = requests.get(url, params=payload).json()['data']['points']
        
        if each == 'count':
            count_df = pd.DataFrame.from_dict(jsonData,orient='index')
            count_df = count_df.rename(columns={0:'Total Addresses'})
            
        else:
            ratio_df = pd.DataFrame.from_dict(jsonData,orient='index')
            df = count_df.merge(ratio_df, how='left', left_index=True, right_index=True)
            
df = df.sort_index()          
        

Output:

print(df.to_string())
                      Total Addresses  topTenHolderRatio  topTwentyHolderRatio  topFiftyHolderRatio  topHundredHolderRatio
2021-11-24T00:00:00Z         39279627               5.25                  7.19                10.51                  13.26
2021-11-25T00:00:00Z         39255811               5.25                  7.19                10.49                  13.22
2021-11-26T00:00:00Z         39339840               5.25                  7.19                10.51                  13.24
2021-11-27T00:00:00Z         39391849               5.23                  7.11                10.45                  13.18
2021-11-28T00:00:00Z         39505340               5.24                  7.11                10.45                  13.18
2021-11-29T00:00:00Z         39502099               5.24                  7.11                10.43                  13.16
2021-11-30T00:00:00Z         39523000               5.24                  7.11                10.38                  13.12

Your Other option is that the data is within the <script> tags in json format. S0 you can pull it out from the initial request site that way too:

from bs4 import BeautifulSoup
import requests
import json
import re

url = 'https://coinmarketcap.com/currencies/bitcoin/holders/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

jsonStr = str(soup.find('script', {'id':'__NEXT_DATA__'}))
jsonStr = re.search(r"({.*})", jsonStr).groups()[0]
jsonData = json.loads(jsonStr)['props']['initialProps']['pageProps']['info']['holders']

df = pd.DataFrame(jsonData).drop('holderList', axis=1).drop_duplicates()

Output:

print(df.to_string())
   holderCount  dailyActive  topTenHolderRatio  topTwentyHolderRatio  topFiftyHolderRatio  topHundredHolderRatio
0     39523000       963625               5.24                  7.11                10.38                  13.12

For the Social Stats in the Project Info, that's within a specific api:

import requests
import pandas as pd

url = 'https://api.coinmarketcap.com/data-api/v3/project-info/detail?slug=bitcoin'
jsonData = requests.get(url).json()
socialStats = jsonData['data']['socialStats']

row = {}
for k, v in socialStats.items():
    if type(v) == dict:
        row.update(v)
    else:
        row.update({k:v})
        
df = pd.DataFrame([row])

Output:

print(df.to_string())
   cryptoId commits contributors  stars  forks watchers              lastCommitAt  members               updatedTime
0         1   31588          836  59687  30692     3881  2021-11-30T00:09:02.000Z  3617460  2021-11-30T16:00:02.365Z
like image 94
chitown88 Avatar answered Nov 21 '25 09:11

chitown88



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!