Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Modify HTML Response (Not Headers)

Tags:

Hoping someone can help me out or point me in the right direction.

I've been asked to find out how to make Akamai (or any other CDN, or NGINX) modify the actual response body.

Why?

I'm to make the CDN change all "http://" requests to "https://" instead of modifying the App code to use "//" for external resource requests.

Is this possible?

Anyone know?

like image 448
Charlie Avatar asked Sep 12 '14 23:09

Charlie


2 Answers

This appears to be possible via a number of different approaches, but that's not to say how advisable it might actually be.

It seems potentially problematic (example: what if you rewrite something that shouldn't have been rewritten?) and machine-resource-intensive (a lot of CPU cycles to parse and munge response bodies, repeatedly).

Here's what I found:

Nginx has the http_sub_module that appears to accomplish this in a fairly straightforward way, assuming what you want to replace is simple and you only need to match one pattern per page, like replacing <a href="http://example.com/... with <a href="https://example.com/..., one or more times. This kind of content-mungery seems sketchy but depending on the situation you're in (which may be one of limited control of the application) it might get you there.

It looks like there's something called http_substitutions_filter, possibly unofficial or at least not part of the core Nginx distribution that can do more powerful filter-based rewriting of response bodies.

Varnish seems to have a similar capability (possibly a plugin) but HAProxy doesn't, since it only deals in headers and leaves bodies alone except when doing gzip offloading. Other reverse-proxy-capable software like Apache or Squid might also offer something useful, that you'd place in front of your application server.

My initial impression, in any event, is that simple string replacing may not quite get you there, and even regex-based replacing isn't really sufficient, without significant sophistication in the regexes, because you always run the risk of rewriting something that you shouldn't.

What I would suggest "really needs to happen" in order to accomplish this purpose in the most correct way, would be to actually interpret the generated HTML with a DOM parsing library, traverse the tree, and modify the relevant elements in-place, before handing the revised document to the requester. This way, the document gets modified based on a contextual understanding of its contents.

It sounds complicated, in my opinion, because it is -- so I would again suggest you reconsider your planned approach unless this is outside your control.

Final thought: Curiosity got the best of me, so I took this question and retrofitted the http reverse proxy I wrote (for a different purpose) so that, based on the content-type, it could actually parse and walk the HTML structure as a proper entity, modifying it in place (as described above), before returning the response body to the requester.

This turns out, as I expected, to be fairly processor-intensive. My test content was 29K of real-world HTML from a live site, with containing 56 <a href ...> and 6 <link rel ...> elements, and the rewrite operation required 128 ms on a 1 GHz Opteron 1218, and 43 ms 2.4GHz Xeon E5620. These benchmarks are strictly for the additional operations -- excluding the (smaller amount of) time required for the actual "proxy" functionality itself. This time cost is not insurmountable, but could add up to a lot of CPU time. This is far longer than a regular expression-based content rewrite would take, but it's far more precise and unlikely to break the pages it touches.

like image 166
Michael - sqlbot Avatar answered Oct 02 '22 13:10

Michael - sqlbot


Nginx's HttpSubsModule worked great for me: http://wiki.nginx.org/HttpSubsModule

Changing from http to https should as simple as this:

location / {     sub_filter_types text/html text/css text/xml;     sub_filter http.example.com https.example.com gi;     sub_filter_once off; }   

by default only the first occurrence is replaced. Set sub_filter_once off; to replace all.

like image 36
Raptor Avatar answered Oct 02 '22 12:10

Raptor