Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use PHP to annotate an string with HTML (i.e How. insert HTML tags to an string by offsets mantaining a valid HTML)?

I'm trying to add HTML tags between words inside a string (wrap words by html tags i.e. HTML annotations). The positions where the HTML tags should be written are delimited by an array of offsets, for example:

//array(Start offset, End offset) in characters
//Note that annotation starts in the Start offset number and ends before the End offset number
$annotationCharactersPositions= array(
   0=>array(0,3),
   1=>array(2,6),
   2=>array(8,10)
);

So to annotate the following HTML text ($source) with the following HTML tag ($tag). That is wrapped the characters delimited by the $annotationPositions array (without taking into account the HTML tags of source).

$source="<div>This is</div> only a test for stackoverflow";
$tag="<span class='annotation n-$cont'>";

the result should be the following (https://jsfiddle.net/cotg2pn1/):

charPos   =--------------------------------- 01---------------------------- 2-------------------------------------------3------------------------------------------45-------67-----------------------------89-------10,11,12,13......
$output = "<div><span class='annotation n-1'>Th<span class='annotation n-2'>i</span></span><span class='annotation n-2'>s</span><span class='annotation n-2'> i</span>s</div> <span class='annotation n-3'>on</span>ly a test for stackoverflow"

How can I program the next function:

    $cont=0;
    $myAnnotationClass="placesOfTheWorld";
    for ($annotationCharactersPositions as $position) {
         $tag="<span class='annotation $myAnnotationClass'>";             
         $source=addHTMLtoString($source,$tag,$position);
         $cont++;
    }

taking into account that the HTML tags of the input string must not be taken into account when counting the characters described in the $annotationCharactersPositions array and each insertion of an annotation (i.e $tag) in the $source text must be taken into account for the encapsulation/annotation of the following annotations.

The idea of this whole process is that given a input text (that may or may not contain HTML tags) a group of characters would be annotated (belonging to one or several words) so that the result would have the selected characters (through an array that defines where each annotation begins and ends) wrapped by HTML tag that can vary (a, span, mark) with a variable number of html attributes (name, class, id, data-*). In addition the result must be a well-formed valid HTML document so that if any annotation is between several annotations, the html should be writing in the output accordingly.

Do you know any library or solution to do this? Maybe PHP DOMDocument functionalities can be useful?¿but how to apply the offsets to the php DomDocument functions? Any idea or help is well received.

Note 1: The input text are UTF-8 raw text with any type of HTML entities embebed (0-n).

Note 2: The input tag could be any HTML tag with variable number of attributes (0-n).

Note 3:The initial position must be inclusive and the final position must be exclusive. i.e. 1º annotation starts before the 2nd character (including the 2 character 'i') and ends before de 6th character (excluding the 6 character 's')

like image 402
Martin Avatar asked May 27 '19 15:05

Martin


People also ask

How do you add a tag to a string in HTML?

Using the innerHTML attribute: To append using the innerHTML attribute, first select the element (div) where you want to append the code. Then, add the code enclosed as strings using the += operator on innerHTML.

How do I strip HTML tags in PHP?

The strip_tags() function strips a string from HTML, XML, and PHP tags. Note: HTML comments are always stripped. This cannot be changed with the allow parameter. Note: This function is binary-safe.

What is the use of Strip_tags () method?

The strip_tags() function is an inbuilt function in PHP which is used to strips a string from HTML, and PHP tags. This function returns a string with all NULL bytes, HTML, and PHP tags stripped from a given $str.

How do I remove a string in HTML?

The HTML tags can be removed from a given string by using replaceAll() method of String class. We can remove the HTML tags from a given string by using a regular expression. After removing the HTML tags from a string, it will return a string as normal text.


1 Answers

After loading the HTML into a DOM document, you can fetch any text node descendant of an element node with an Xpath expression (.//text()) in an iterable list. This allows you to keep track of the characters before the current text node. On the text node you check if the text content (or a part of it) has to be wrapped into the annotation tag. If so separate it and create a fragment with up to 3 nodes. (text before, annotation, text after). Replace the text node with the fragment.

function annotate(
  \DOMElement $container, int $start, int $end, string $name
) {
  $document = $container->ownerDocument;
  $xpath = new DOMXpath($document);
  $currentOffset = 0;
  // fetch and iterate all text node descendants 
  $textNodes = $xpath->evaluate('.//text()', $container);
  foreach ($textNodes as $textNode) {
    $text = $textNode->textContent;
    $nodeLength = grapheme_strlen($text);
    $nextOffset = $currentOffset + $nodeLength;
    if ($currentOffset > $end) {
      // after annotation: break
      break;
    }
    if ($start >= $nextOffset) {
      // before annotation: continue
      $currentOffset = $nextOffset;
      continue;
    }
    // make string offsets relative to node start
    $relativeStart = $start - $currentOffset;
    $relativeLength = $end - $start;
    if ($relativeStart < 0) {
      $relativeLength -= $relativeStart;
      $relativeStart = 0;
    }
    $relativeEnd = $relativeStart + $relativeLength;
    // create a fragment for the annotation nodes
    $fragment = $document->createDocumentFragment();
    if ($relativeStart > 0) {
      // append string before annotation as text node
      $fragment->appendChild(
        $document->createTextNode(grapheme_substr($text, 0, $relativeStart))
      );
    }
    // create annotation node, configure and append
    $span = $document->createElement('span');
    $span->setAttribute('class', 'annotation '.$name);
    $span->textContent = grapheme_substr($text, $relativeStart, $relativeLength);
    $fragment->appendChild($span);
    if ($relativeEnd < $nodeLength) {
      // append string after annotation as text node
      $fragment->appendChild(
        $document->createTextNode(grapheme_substr($text, $relativeEnd))
      );
    }
    // replace current text node with new fragment
    $textNode->parentNode->replaceChild($fragment, $textNode);
    $currentOffset = $nextOffset;
  }
}

$html = <<<'HTML'
<div><div>This is</div> only a test for stackoverflow</div>
HTML;

$annotations = [
  0 => [0, 3],
  1 => [2, 6],
  2 => [8, 10]
];

$document = new DOMDocument();
$document->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

foreach ($annotations as $index => $offsets) {
  annotate($document->documentElement, $offsets[0], $offsets[1], 'n-'.$index);
}

echo $document->saveHTML();

Output:

<div><div><span class="annotation n-0">Th<span class="annotation n-1">i</span></span><span class="annotation n-1">s is</span></div> <span class="annotation n-2">on</span>ly a test for stackoverflow</div>
like image 181
ThW Avatar answered Sep 20 '22 21:09

ThW