Given a news article webpage (from any major news source such as times or bloomberg), I want to identify the main article content on that page and throw out the other misc elements such as ads, menus, sidebars, user comments.
What's a generic way of doing this that will work on most major news sites?
What are some good tools or libraries for data mining? (preferably python based)
Inspecting the HTML of a Website Firstly load the web page you want to scrape from. Right click on the page and select inspect. This will load the HTML of the website which shows the make-up of the website. Select the tool at the top left of the pane to highlight the code responsible for each part of the web page.
You can use the Attribute selector to scrape these hidden tags from HTML. You can write your selector manually and then enter the “content” in the attribute name option to scrape efficiently.
There are a number of ways to do it, but, none will always work. Here are the two easiest:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With