Split innerhtml into text for translation JSON in javascript

Question

Currently I am working on an application that needs to extract the innerHTML of Body and then take the text out of it in a JSON. That JSON will be used for translation and then the translated JSON will be used as input to create the same HTML markup but with translated text. Please see the snippet below.

HTML Input

<section>Hello, <div>This is some text which I need to extract.<a class="link">It can be <strong> complicated.</strong></a></div><span>The extracted text should contain the html tag if it has any html tag in the span,p or a tag</span><p>Please see the <span>desired output below.</span></p>Thanks!</section>';

Translation JSON Output

{
"text1":"Hello, ",
"text2":"This is some text which I need to extract.",
"text3":"It can be <strong> complicated.</strong>",
"text4":"The extracted text should contain the html tag if it 
             has any html tag in the span,p or a tag",
"text5":"Please see the <span>desired output below.</span>",
"text6":"Thanks!"
}

Translated JSON Input

{
"text1":"Hello,-in spanish ",
"text2":"This is some text which I need to extract.-in spanish",
"text3":"It can be <strong> complicated.-in spanish</strong>",
"text4":"The extracted text should contain the html tag if it 
             has any html tag in the span,p or a tag-in spanish",
"text5":"Please see the <span>desired output below.-in spanish</span>",
"text6":"Thanks!-in spanish"
}

Translated HTML Output

<section>Hello,-in spanish <div>This is some text which I need to extract.-in spanish<a class="link">It can be <strong> complicated.-in spanish</strong></a></div><span>The extracted text should contain the html tag if it has any html tag in the span,p or a tag-in spanish</span><p>Please see the <span>desired output below.</span></p>Thanks!-in spanish</section>';

I tried various regex but below is the one of the flows I ended up doing but I am not able to achieve the desired output with this.

//encode
const bodyHTML = '<a class="test">hello world<strong> this is gonna be hard</strong></a>';
//replace the quotes with escape quotes
const htmlContent = bodyHTML.replace(/"/g, '\"');
let count = 0;
let translationObj = {};
let newHtml = htmlContent.replace(/\>(.*?)\</g, function(match) {
  //remove the special character	
  match = match.replace(/\>|\</g, '');
  count = count + 1;
  translationObj[count] = match;

  return '>~' + count + '~<';
});

const translationJSON = '{"1":"hello world in spanish","2":" this is gonna be hard in spanish","3":""}';

//decode
let trasnaltedHtml = '';
const translatedObj = JSON.parse(translationJSON)
trasnaltedHtml = newHtml.replace(/\~(.*?)\~/g, function(match) {
  //remove the special character	
  match = match.replace(/\~|\~/g, '');

  return translatedObj[match];
});
//replace the escape quotes with quotes
trasnaltedHtml = trasnaltedHtml.replace(/\"/g, '"');
//console.log()
console.log("bodyHTML", bodyHTML);
console.log('tranlationObj', translationObj);
console.log("translationJSON", translationJSON);
console.log('newHtml', newHtml);
console.log("trasnaltedHtml", trasnaltedHtml);

I am looking for a working regex or any other approach in JS world that would get the desired result. I wanna get all the text inside HTML in the form of JSON. Another condition is not to split the text if they have some inner html tags so that we don't loose the context of the sentence like <p>Click <a>here</a></p> it should be considered as one text "Click <a>here</a>". I hope I clarified all the doubts

Thanks in advance !

T.J. Crowder · Accepted Answer

By far, the best way to do this is by using an HTML parser, then looping through the text nodes in the tree. You cannot correctly handle a non-regular markup language like HTML with just simple JavaScript regular expressions¹ (many have wasted a lot of time trying), and that's not even taking into account all of HTML's specific peculiarities.

There are at least a couple, probably several, well-tested, actively-supported DOM parser modules available on npm.

So the basic structure would be:

Parse the HTML into a DOM.
Walk the DOM in a defined order (typically depth-first traversal) building up your object or array of text strings to translate from the text nodes you encounter.
Convert that object/array to JSON if necessary, send it off for translation, get the result back, parse it from JSON into an object/array again if necessary.
Walk the DOM in the same order, applying the results from the object/array.
Serialize the DOM to HTML.
Send the result.

Here's an example — naturally here I'm using the HTML parser built into the browser rather than an npm module, and the API to whatever module you're using may be slightly different, but the concept is the same:

var html = '<section>Hello, <div>This is some text which I need to extract.<a class="link">It can be <strong> complicated.</strong></a></div><span>The extracted text should contain the html tag if it has any html tag in the span,p or a tag</span><p>Please see the <span>desired output below.</span></p>Thanks!</section>';
var dom = parseHTML(html);
var strings = [];
walk(dom, function(node) {
  if (node.nodeType === 3) { // text node
    strings.push(node.nodeValue);
  }
});
console.log("strings = ", strings);
var translation = translate(strings);
console.log("translation = ", translation);
var n = 0;
walk(dom, function(node) {
  if (node.nodeType === 3) { // text node
    node.nodeValue = translation[n++];
  }
});
var newHTML = serialize(dom);
document.getElementById("before").innerHTML = html;
document.getElementById("after").innerHTML = newHTML;


function translate(strings) {
  return strings.map(str => str.toUpperCase());
}

function walk(node, callback) {
  var child;
  callback(node);
  switch (node.nodeType) {
    case 1: // Element
      for (child = node.firstChild; child; child = child.nextSibling) {
        walk(child, callback);
      }
  }
}

// Placeholder for module function
function parseHTML(html) {
  var div = document.createElement("div");
  div.innerHTML = html;
  return div;
}

// Placeholder for module function
function serialize(dom) {
  return dom.innerHTML;
}

<strong>Before:</strong>
<div id="before"></div>
<strong>After:</strong>
<div id="after"></div>

¹ Some "regex" libs (or regex features of other languages) are really regex+more features that can help you do something similar, but they're not just regex, and JavaScript's built-in ones don't have those features.

Split innerhtml into text for translation JSON in javascript

Tags:

json

javascript

html

regex

parsing

HTML Input

Translation JSON Output

Translated JSON Input

Translated HTML Output

dk111989

1 Answers

T.J. Crowder

Recent Activity

Donate For Us

Split innerhtml into text for translation JSON in javascript

Tags:

json

javascript

html

regex

parsing

HTML Input

Translation JSON Output

Translated JSON Input

Translated HTML Output

dk111989

1 Answers

T.J. Crowder

Related questions

Recent Activity

Donate For Us