Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Where to find entire HTML content in Chromium source code

I am currently trying to do this: once the webpage loads, find out if the URL is of a certain pattern (say www.wikipedia.com/*), then, if so, parse the HTML content of that webpage like one can do with BeautifulSoup, and check if the webpage has a div with class foo and id boo. Any idea where can I writ this code, that is, where can I get access to URL, where do I need to listen to to know that the webpage has finished loading following which I can look for the URL and HTML content, and where and how I can parse the HTML?

I tried going through the code in src/chrome/browser/tab_contents, I could not find any reasonable place where I can do all this.

like image 605
SexyBeast Avatar asked Aug 30 '18 11:08

SexyBeast


1 Answers

Take a look at the following conceptual application layers which represent how Chromium displays web pages:

How Chromium Displays Web Pages: Conceptual application layers

Image Source: https://docs.google.com/drawings/d/1gdSTfvLxbJDbX8oiWo5LTwAmXmdMQvjoUhYEhfhj0-k/edit

The different layers are described as:

  • WebKit: Rendering engine shared between Safari, Chromium, and all other WebKit-based browsers. The Port is a part of WebKit that integrates with platform dependent system services such as resource loading and graphics.
  • Glue: Converts WebKit types to Chromium types. This is our "WebKit embedding layer." It is the basis of two browsers, Chromium, and test_shell (which allows us to test WebKit).
  • Renderer / Render host: This is Chromium's "multi-process embedding layer." It proxies notifications and commands across the process boundary.
  • WebContents: A reusable component that is the main class of the Content module. It's easily embeddable to allow multiprocess rendering of HTML into a view. See the content module pages for more information.
  • Browser: Represents the browser window, it contains multiple WebContentses.
  • Tab Helpers: Individual objects that can be attached to a WebContents (via the WebContentsUserData mixin). The Browser attaches an assortment of them to the WebContentses that it holds (one for favicons, one for infobars, etc).

Since your goal is to access and interpret the HTML content of a web page by element and/or class, you can look to the rendering process which uses Blink:

The renderers use the Blink open-source layout engine for interpreting and laying out HTML.

Blink has a WebDocument class which allows you to access the HTML content and other properties of a web page:

WebDocument document = GetMainFrame()->GetDocument();
WebElement element = document.GetElementById(WebString::FromUTF8("example"));
// document.Url();
like image 59
Grant Miller Avatar answered Sep 17 '22 08:09

Grant Miller