Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace parts of HTML strings in multiple documents

I am saving parts of an existing Microsoft word document as HTML and embedding this HTML dynamically in panels to give instructions to the users.

This is working fine except for the images which are not appearing. Inspecting the HTML that is generated I see that the HTML to display the image is

<img src="home_files/image001.png" />

In Visual Studio the HTML help pages are stored in a folder called Help, so I changed this line to include the help folder

<img src="help/home_files/image001.png" />

With this change the image is displayed correctly.


I have to generate over 50 help pages from word documents so do not manually want to change all of the image locations, especially as if there are changes some pages will be regenerated.

Is there a way for the images to be displayed correctly without editing the messy documents gererated by Word?

Or is there a better way to generate HTML versions of word documents?

I didn't use PDF's as not everyones browser will display PDF's embedded into a web page

like image 998
Nick Le Page Avatar asked May 30 '15 08:05

Nick Le Page


2 Answers

Is there a way for the images to be displayed correctly without editing the messy documents gererated by Word?

I guess you could just run some simple client side code to change the src attribute of those <img> tags. You would get something like

var imgs = document.querySelector("container").querySelectorAll("img");
for(var i=0;i<imgs.length;i++){
  var oldSrc = imgs[i].getAttribute();
  imgs[i].setAttribute("src", "help/"+oldSrc);
}

The same can of course be done in any server side DOM implementation, do note that these can lack the features described in the snippet above and might thus require rewriting.

Or is there a better way to generate HTML versions of word documents?

To be honest it's a pretty bad idea in general (or at least it was in the past). Word isn't meant for this kind of stuff, so you might run into a lot of trouble. I worked for a company years ago where they had a special tool just to clean up HTML content copied from Word and although I never did any maintance on it I do remember the code being quite complex, so I wouldn't be surprised if you might run into unexpected issues. Far more logical is to have the content being written in an editor that is meant for the web in the first place. Even copy pasting into an editor meant for the web might do wonders (if the editor is a fairly strict one).

like image 166
David Mulder Avatar answered Oct 26 '22 21:10

David Mulder


<?php
function processFiles($root)
{
    $root = rtrim($root, DIRECTORY_SEPARATOR) . DIRECTORY_SEPARATOR;
    if($hDir = opendir($root))
    {
        while(false !== $filename = readdir($hDir))
        {
            if($filename == '.' || $filename == '..')
                continue;

            $file = $root . $filename;
            if(is_dir($file))
                call_user_func(__FUNCTION__, $file);
            elseif(pathinfo($file, PATHINFO_EXTENSION) == 'html')
            {
                $old = file_get_contents($file);
                $new = str_replace('home_files/', 'help/home_files/', $old);
                file_put_contents($file, $new);
            }
        }
        closedir($hDir);
    }
}

processFiles('folder/with/html-files/');

This will process all of your *.html files and do a str_replace() on them to fix the wrong path.

like image 45
Nat Avatar answered Oct 26 '22 21:10

Nat