I'd like to either remove an HTML element or simply remove first N characters of a webpage before evaluating/rendering it.
Is there any way to do that?
It depends on multiple scenarios. I will only outline the steps for each combination of the answers to the following questions.
Register to the onResourceRequested
listener and request.abort()
depending on the matched url.
This can only be done when the following code blocks do not depend on the code that should not be removed (which is unlikely). This is most likely necessary for click events that are registered in the DOM.
In this case cancel the request like in 1., download the script through an XHR, remove the unwanted code parts and add code block to the DOM. For this to work, you would need to disable web security, because otherwise no resource can be requested if it is not on the same domain: --web-security=false
.
onload
(ol) & can be removed completely (rc)This is probably very error prone. You would begin an Interval with setInterval(function(){}, 5)
from a page.onInitialized
callback. Inside the interval you would need to check if window.onload
(or something else you can get your hands on) is set in the page context. You remove it, if it is indeed the function that you wanted to remove by checking window.onload.toString().match(/something/)
.
This can be done directly and completely inside the page context (inside page.evaluate
).
onload
(ol) & contains other code too (nr)Begin like in 3., but instead of removing window.onload
, you can do
eval("window.onload = " + window.onload.toString().replace(/something/,''))
You can load the page as an XHR, replace the text and apply the adjusted content to the page. This will essentially be a filled about:blank
page. For this to work, you would need to disable web security, because otherwise no resource can be requested if it is not on the same domain: --web-security=false
or --local-to-remote-url-access=true
. This would also work for 3. and 4..
There is still one problem though. Pages don't use full URLs most of the time. So when a script or element refers to stuff.php
PhantomJS cannot request it. When the page.content
is set then the page URL is essentially about:blank and all requests with incomplete URLs point to file:///...
. Obviously there are no such files. Those resources must be replaced with their full URL counterparts.
There are three types of such URLs:
//example.com/resource.php
variable protocol/resource.php
variable protocol and domainresource.php
variable protocol, domain and path to resourceComplete example:
var page = require('webpage').create(),
url = 'http://www.example.com';
page.open(url, function(status) {
if (status !== 'success') {
console.log('Unable to access network');
} else {
var content = page.evaluate(function(url){
var xhr = new XMLHttpRequest();
xhr.open("GET", url, false);
xhr.send();
return xhr.responseText;
}, url);
page.render("test_example.png");
page.content = content.replace(/xample/g,"asy");
page.render("test_easy.png");
console.log("url "+page.url); // about:blank
phantom.exit();
}
});
You might want to look into proper manipulation techniques apart from the simple string replace.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With