Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Modify Google News to eliminate unwanted news sources

TL;DR

I want to write some code to filter out which articles are displayed on Google News based on the news source.


(The long version)

I have traditionally used the "Personalize" option in Google News to limit which news sources are used (e.g., "don't show articles from FooNews"). However, the personalization options don't let you completely block a news source...the best you can do is tell it to use that source "rarely" (they don't offer a "never" option):

enter image description here

Firefox is my browser-of-choice, so I finally sat down to see if I could write some code to tackle the problem, but I wanted to see what my options were and what might be the best choice. Here's what I've done/learned so far:

Option 1: Filter the incoming data

I did some Googling to see if it would be possible to intercept the response data and filter out the unwanted news sources before they are ever rendered in the browser, but couldn't find any clear advice on how to do this. Using Fiddler, I can see the fairly-simple list of news sources coming over the wire to the browser from Google News, and I'm assuming one of Google's scripts on the page takes that list of news sources and builds the HTML to format them according to the Google News page structure (though I could be wrong on this). In other words, I thought what I saw was that the response stream wasn't sending over the page HTML...all it was doing was sending over the list of news sources. If true, it would be easiest and cleanest to simply filter this stream of news sources before it even hits the in-page formatting script.

enter image description here

Option 2: Filter the data before the DOM is constructed

I started fiddling with Mutation Observers to see if I could filter out the unwanted news sources by catching and removing those nodes as the DOM was being constructed for the page. I started by using the mutation-summary.js library, but instead of firing an event for every news article as it got added to the DOM, I was only seeing a handful of notifications. Maybe I was doing something wrong, but I need to be notified when every news article is added to the DOM in order to have an effective filter. I was going to look at writing some plain-vanilla JS mutation observers next (skipping the library) but wanted to wait to see if there were better options first.

Option 3: Filter the data after the DOM is constructed

I've heard others suggest this approach:

  1. Use CSS to hide the entire document body
  2. Wait until the DOM is built
  3. Do the filtering by finding the unwanted DOM nodes and deleting them
  4. And finally, unhide the modified body (I guess this trick prevents the page flicker you get from modifying the DOM after it is initially built)

I wrote some test code to try this approach, and though tedious, it's not too hard. After studying the structure of the DOM on an already-built Google News page, I was able to write some code to search and walk the DOM to remove the news articles I didn't want to see. However it's both messier and a lot more work to do it this way, since this leaves "holes" in the page structure where the deleted DOM nodes were. With more work I can move other remaining news articles around to fill those "holes", but I'd rather use one of the other methods if possible, as they seem to be easier and cleaner...not to mention faster. Fiddling with the DOM after it is already built takes longer, as hiding the page with CSS until the process is done leaves the user waiting to see anything until the page is fully loaded, altered, and redisplayed.

The question

My intuition says that Option 1 would be cleanest and fastest (if it is possible to do), then Option 2 if not, and finally Option 3 as a last resort.

I would eventually like to turn this into a Firefox extension, so I want the solution I choose to have the following qualities:

  • As easy as possible to maintain the code (not so much an issue with initial complexity, but want it to be easy to revisit later when changes are needed to keep the extension up-to-date). Ideally the code would be as decoupled as possible from a dependency on the specific HTML format of the Google News page so that a code update isn't required every time Google tweaks the page.
  • As performant as possible (no lagginess, page flicker, etc.). I don't want users uninstalling the extension because it feels like a piece of junk.
  • As cross-browser as possible (to enable releasing the extension for Chrome, Edge, etc. in the future)

Of all the possible technical approaches to this problem (including others I might have missed), which will best satisfy my requirements?

like image 279
RSW Avatar asked Nov 09 '22 05:11

RSW


1 Answers

I think you can do Option 1 easily enough. It would be similar to the strategy used for Option 3, only you would be manually coverting the responses from your screenshot into an off-dom dom for querying (e.g.

var topNode = document.createElement('div'); 
div.innerHTML = response.html;

You could create a document fragment to serve as your work space for multiple responses, if needed.

I think you already know this, but for clarity's sake, next steps would be to query the dom you've constructed for the source hierarchy elements (e.g. .source .source-pref for the side bar, .source-cell .al-attribution-source for the main section). Then just iterate over the nodes and look for innerText matching your offending news sources. For matches, walk back up the dom, and remove the outer-most elements.

Then sub the innerHTML for your head node back in for the response.

like image 111
Gopherkhan Avatar answered Nov 15 '22 12:11

Gopherkhan