Currently I am working on an application that needs to extract the innerHTML of Body and then take the text out of it in a JSON. That JSON will be used for translation and then the translated JSON will be used as input to create the same HTML markup but with translated text. Please see the snippet below.
<section>Hello, <div>This is some text which I need to extract.<a class="link">It can be <strong> complicated.</strong></a></div><span>The extracted text should contain the html tag if it has any html tag in the span,p or a tag</span><p>Please see the <span>desired output below.</span></p>Thanks!</section>';
{
"text1":"Hello, ",
"text2":"This is some text which I need to extract.",
"text3":"It can be <strong> complicated.</strong>",
"text4":"The extracted text should contain the html tag if it
has any html tag in the span,p or a tag",
"text5":"Please see the <span>desired output below.</span>",
"text6":"Thanks!"
}
{
"text1":"Hello,-in spanish ",
"text2":"This is some text which I need to extract.-in spanish",
"text3":"It can be <strong> complicated.-in spanish</strong>",
"text4":"The extracted text should contain the html tag if it
has any html tag in the span,p or a tag-in spanish",
"text5":"Please see the <span>desired output below.-in spanish</span>",
"text6":"Thanks!-in spanish"
}
<section>Hello,-in spanish <div>This is some text which I need to extract.-in spanish<a class="link">It can be <strong> complicated.-in spanish</strong></a></div><span>The extracted text should contain the html tag if it has any html tag in the span,p or a tag-in spanish</span><p>Please see the <span>desired output below.</span></p>Thanks!-in spanish</section>';
I tried various regex but below is the one of the flows I ended up doing but I am not able to achieve the desired output with this.
//encode
const bodyHTML = '<a class="test">hello world<strong> this is gonna be hard</strong></a>';
//replace the quotes with escape quotes
const htmlContent = bodyHTML.replace(/"/g, '\\"');
let count = 0;
let translationObj = {};
let newHtml = htmlContent.replace(/\>(.*?)\</g, function(match) {
//remove the special character
match = match.replace(/\>|\</g, '');
count = count + 1;
translationObj[count] = match;
return '>~' + count + '~<';
});
const translationJSON = '{"1":"hello world in spanish","2":" this is gonna be hard in spanish","3":""}';
//decode
let trasnaltedHtml = '';
const translatedObj = JSON.parse(translationJSON)
trasnaltedHtml = newHtml.replace(/\~(.*?)\~/g, function(match) {
//remove the special character
match = match.replace(/\~|\~/g, '');
return translatedObj[match];
});
//replace the escape quotes with quotes
trasnaltedHtml = trasnaltedHtml.replace(/\\"/g, '"');
//console.log()
console.log("bodyHTML", bodyHTML);
console.log('tranlationObj', translationObj);
console.log("translationJSON", translationJSON);
console.log('newHtml', newHtml);
console.log("trasnaltedHtml", trasnaltedHtml);
I am looking for a working regex or any other approach in JS world that would get the desired result. I wanna get all the text inside HTML in the form of JSON. Another condition is not to split the text if they have some inner html tags so that we don't loose the context of the sentence like
<p>Click <a>here</a></p>
it should be considered as one text "Click <a>here</a>". I hope I clarified all the doubts
Thanks in advance !
By far, the best way to do this is by using an HTML parser, then looping through the text nodes in the tree. You cannot correctly handle a non-regular markup language like HTML with just simple JavaScript regular expressions¹ (many have wasted a lot of time trying), and that's not even taking into account all of HTML's specific peculiarities.
There are at least a couple, probably several, well-tested, actively-supported DOM parser modules available on npm.
So the basic structure would be:
Parse the HTML into a DOM.
Walk the DOM in a defined order (typically depth-first traversal) building up your object or array of text strings to translate from the text nodes you encounter.
Convert that object/array to JSON if necessary, send it off for translation, get the result back, parse it from JSON into an object/array again if necessary.
Walk the DOM in the same order, applying the results from the object/array.
Serialize the DOM to HTML.
Send the result.
Here's an example — naturally here I'm using the HTML parser built into the browser rather than an npm module, and the API to whatever module you're using may be slightly different, but the concept is the same:
var html = '<section>Hello, <div>This is some text which I need to extract.<a class="link">It can be <strong> complicated.</strong></a></div><span>The extracted text should contain the html tag if it has any html tag in the span,p or a tag</span><p>Please see the <span>desired output below.</span></p>Thanks!</section>';
var dom = parseHTML(html);
var strings = [];
walk(dom, function(node) {
if (node.nodeType === 3) { // text node
strings.push(node.nodeValue);
}
});
console.log("strings = ", strings);
var translation = translate(strings);
console.log("translation = ", translation);
var n = 0;
walk(dom, function(node) {
if (node.nodeType === 3) { // text node
node.nodeValue = translation[n++];
}
});
var newHTML = serialize(dom);
document.getElementById("before").innerHTML = html;
document.getElementById("after").innerHTML = newHTML;
function translate(strings) {
return strings.map(str => str.toUpperCase());
}
function walk(node, callback) {
var child;
callback(node);
switch (node.nodeType) {
case 1: // Element
for (child = node.firstChild; child; child = child.nextSibling) {
walk(child, callback);
}
}
}
// Placeholder for module function
function parseHTML(html) {
var div = document.createElement("div");
div.innerHTML = html;
return div;
}
// Placeholder for module function
function serialize(dom) {
return dom.innerHTML;
}
<strong>Before:</strong>
<div id="before"></div>
<strong>After:</strong>
<div id="after"></div>
¹ Some "regex" libs (or regex features of other languages) are really regex+more features that can help you do something similar, but they're not just regex, and JavaScript's built-in ones don't have those features.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With