Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Break sentences in XML using PHP

Tags:

php

split

xml

I am new to PHP and i have an xml file and i want to extract the sentences in the xml file to an array using PHP, to break down the sentences to having 3 words each time. The sentences will be divided into parts.
The XML below is from a XML file.

<?xml version="1.0" encoding="utf-8" ?>
<document>
    <content>
        <segment>
            <sentence>
                <word>Hi</word>
                <word>there</word>
                <word>people</word>
                <word>I</word>
                <word>want</word>
                <word>to</word>
                <word>introduce</word>
                <word>you</word>
                <word>to</word>
                <word>my</word>
                <word>world</word>
            </sentence>
            <sentence>
                <word>Hi</word>
                <word>there</word>
                <word>people</word>
                <word>I</word>
                <word>want</word>
                <word>to</word>
                <word>introduce</word>
                <word>you</word>
                <word>to</word>
                <word>my</word>
                <word>world</word>
            </sentence>
        </segment>
    </content>
</document>

The output will be:

Hi there people
I want to 
introduce you to
my world
Hi there people
I want to 
introduce you to
my world

I have created a function to process the xml trannscript.

function loadTranscript($xml) {
    $getfile = file_get_contents($xml);
    $arr = simplexml_load_string($getfile); 
    foreach ($arr->content->segment->sentence as $sent) {
        $count = str_word_count($sent,1);
        $a=array_chunk($count,3);
        foreach ($a as $a){
            echo implode(' ',$a);
            echo PHP_EOL;   
        }
    }
}

But was unable to produce the output. Is $sent considered an array? I want to break the sentences at XML level.

like image 521
kkbum Avatar asked Mar 30 '26 17:03

kkbum


2 Answers

I'm not sure why everyone is so scared of SimpleXML, and I think it's definitely the right tool for this job.

$sent is not an array, but an object representing the <sentence> element and all its children; it has some array-like properties, but not ones that array_chunk can work with.

You can actually use array_chunk, but you need to do three things to make your current code work:

  • cast $sent from object to array with (array)$sent (which will give an array of all children of the <sentence> node) or (array)$sent->word (which will limit it to those called <word>, in case there was a mixture)
  • pass in that array to array_chunk, not $count (which you don't need)
  • don't use the same variable twice with conflicting meanings (foreach( $a as $a ))

So:

$chunks = array_chunk((array)$sent->word, 3);
foreach ($chunks as $a_chunk) {
    echo implode(' ', $a_chunk);
    echo PHP_EOL;   
}

Alternatively, you can do without array_chunk easily enough by just displaying a newline every third word:

$counter = 0;
foreach ( $words as $word ) {
    $counter++;
    echo $word;
    if ( $counter % 3 == 0 ) {
         echo PHP_EOL;
    } else {
         echo ' ';
    }
}

Then all you need to do is nest that loop inside your existing one:

foreach ($arr->content->segment->sentence as $sent) {
    $counter = 0;
    foreach ( $sent->word as $word ) {
        $counter++;
        echo $word;
        if ( $counter % 3 == 0 ) {
             echo PHP_EOL;
        } else {
             echo ' ';
        }
    }
    echo PHP_EOL;
}

Up to you which you think is cleaner, but it's good to understand both so you can adapt them to future needs.

like image 134
IMSoP Avatar answered Apr 01 '26 08:04

IMSoP


Is $xml a string or a file path? I'm considering that is a string for this answer.

Use DOMDocument and make it happens

function loadTranscript($xml) {
    $doc = new DOMDocument();
    $doc->loadXML($xml);
    $words = $doc->getElementsByTagName('word');
    $i = 0;
    foreach ($words as $word) {
        if ($i >= 3) {
            echo "\n";//it works on console. For browsers you should use echo "<br>";
            $i = 0;
        }
        echo $word->nodeValue.' ';
        $i++;
    }
}

I used a extra $i flag to avoid the foreach inside another foreach, but you can adapt the code to your needs.

As suggested by @CD001 in the comments, following is a new version that consider more than one tag <sentence>.

function loadTranscript($xml) {
    $doc = new DOMDocument();
    $doc->loadXML($xml);
    $sentences = $doc->getElementsByTagName('sentence');
    foreach($sentences as $sentence) {
      $words = $sentence->getElementsByTagName('word');
      $i = 0;
      foreach ($words as $word) {
          if ($i >= 3) {
              echo "\n";
              $i = 0;
          }
          echo $word->nodeValue.' ';
          $i++;
      }
      echo "\n";
    }
}

To read the XML from a file, replace the $doc->loadXML($xml); by $doc->load('file/path/string.xml');

like image 34
James Avatar answered Apr 01 '26 08:04

James



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!