Extract text between two different tags beautiful soup

Tags:

I'm trying to extract the text content of the article from this web page.

I'm just trying to extract the article content and not the "About the author part".

The problem is that all the content aren't within a tag like <div>. Hence I can't extract them since all are within <p> tags. And when I extract all the <p> tags I also get the "About the author" part. I have to scrape many pages from this website. Is there a way to do this using beautiful soup?

I'm currently trying:

p_tags=soup.find_all('p')
for row in p_tags:
    print(row)

406

asked Jul 01 '18 06:07

Sandeep Jaiswal

2 Answers

All the paragraphs that you want are located inside the <div class="td-post-content"> tag along with the paragraphs for the author information. But, the required <p> tags are direct children of this <div> tag, while the other not required <p> tags are not direct children (they are nested inside other div tags).

So, you can use recursive=False to access those tags only.

Code:

import requests
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}

r = requests.get('https://www.the-blockchain.com/2018/06/29/mcafee-labs-report-6x-increase-in-crypto-mining-malware-incidents-in-q1-2018/', headers=headers)
soup = BeautifulSoup(r.text, 'lxml')

container = soup.find('div', class_='td-post-content')
for para in container.find_all('p', recursive=False):
    print(para.text)

Output:

Cybersecurity giant McAfee released its McAfee Labs Threat Report: June 2018 on Wednesday, outlining the growth and trends of new malware and cyber threats in Q1 2018. According to the report, coin mining malware saw a 623 percent growth in the first quarter of 2018, infecting 2.9 million machines in that period. McAfee Labs counted 313 publicly disclosed security incidents in the first three months of 2018, a 41 percent increase over the previous quarter. In particular, incidents in the healthcare sector rose 57 percent, with a significant portion involving Bitcoin-based ransomware that healthcare institutions were often compelled to pay.
Chief Scientist at McAfee Raj Samani said, “There were new revelations this quarter concerning complex nation-state cyber-attack campaigns targeting users and enterprise systems worldwide. Bad actors demonstrated a remarkable level of technical agility and innovation in tools and tactics. Criminals continued to adopt cryptocurrency mining to easily monetize their criminal activity.”
Sizeable criminal organizations are responsible for many of the attacks in recent months. In January, malware dubbed Golden Dragon attacked organizations putting together the Pyeongchang Winter Olympics in South Korea, using a malicious word attachment to install a script that would encrypt and send stolen data to an attacker’s command center. The Lazarus cybercrime ring launched a highly sophisticated Bitcoin phishing campaign called HaoBao that targeted global financial organizations, sending an email attachment that would scan for Bitcoin activity, credentials and mining data.
Chief Technology Officer at McAfee Steve Grobman said, “Cybercriminals will gravitate to criminal activity that maximizes their profit. In recent quarters we have seen a shift to ransomware from data-theft,  as ransomware is a more efficient crime. With the rise in value of cryptocurrencies, the market forces are driving criminals to crypto-jacking and the theft of cryptocurrency. Cybercrime is a business, and market forces will continue to shape where adversaries focus their efforts.”

135

answered Oct 19 '22 21:10

Keyur Potdar

you need to use selenium, because i try to do it with requests and it don't work because data is load with javascript and follow by bs4

import requests, bs4
from selenium import webdriver

driver = webdriver.Chrome('/usr/local/bin/chromedriver') 
website = "https://www.the-blockchain.com/2018/06/29/mcafee-labs-report-6x-increase-in-crypto-mining-malware-incidents-in-q1-2018/"
driver.get(website) 
html = driver.page_source
soup = bs4.BeautifulSoup(html, "html.parser")

elements = soup.select('#wpautbox_latest-post > ul > li')
for elem in elements:
    print(elem.text)

Output

McAfee Labs Report 6x Increase in Crypto Mining Malware Incidents in Q1 2018 - June 29, 2018
Facebook Updates Policy To Allow Vetted Crypto Businesses to Advertise, ICOs Still Banned - June 27, 2018
Following in Vitalik’s Footsteps? Polkadot’s Habermeier Awarded Thiel Fellowship - June 26, 2018
And many other article titles

answered Oct 19 '22 22:10

Druta Ruslan

Related questions
                            
                                How to share numpy random state of a parent process with child processes?
                            
                                Understanding Self Internally in Python
                            
                                Extracting items out of an element.ResultSet
                            
                                How to parallelize python api calls?
                            
                                Replace negative values in single DataFrame column
                            
                                Find the maximum values of a column in multiindex dataframe and return all its values
                            
                                Getting Flask JSON response as an HTML Table?
                            
                                Python Numpy vectorize nested for-loops for combinatorics
                            
                                Python error: FileNotFoundError: [Errno 2] No such file or directory
                            
                                Creating an RGB picture in Python with OpenCV from a randomized array
                            
                                Tweepy check if a tweet is a retweet
                            
                                Python pysftp get_r from Linux works fine on Linux but not on Windows
                            
                                Python - Matplotlib / matplotlib.cbook.TimeoutError: LOCKERROR
                            
                                Tensorflow: how to use pretrained weights in new graph?
                            
                                'jupyter notebook' command not working on Linux
                            
                                Split lists within dataframe column into multiple columns [duplicate]
                            
                                Missing required dependencies ['numpy'] in AWS Lambda after installing numpy into directory, how to fix?
                            
                                Specify options and arguments dynamically
                            
                                Filtering out rows with non-alphanumeric characters
                            
                                How use Connection in Fabric 2?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Extract text between two different tags beautiful soup

Tags:

python

html

python-3.x

beautifulsoup

web-scraping

Sandeep Jaiswal

People also ask

2 Answers

Keyur Potdar

Druta Ruslan

Recent Activity

Donate For Us