I am new to PHP and i have an xml file and i want to extract the sentences in the xml file to an array using PHP, to break down the sentences to having 3 words each time. The sentences will be divided into parts.
The XML below is from a XML file.
<?xml version="1.0" encoding="utf-8" ?>
<document>
<content>
<segment>
<sentence>
<word>Hi</word>
<word>there</word>
<word>people</word>
<word>I</word>
<word>want</word>
<word>to</word>
<word>introduce</word>
<word>you</word>
<word>to</word>
<word>my</word>
<word>world</word>
</sentence>
<sentence>
<word>Hi</word>
<word>there</word>
<word>people</word>
<word>I</word>
<word>want</word>
<word>to</word>
<word>introduce</word>
<word>you</word>
<word>to</word>
<word>my</word>
<word>world</word>
</sentence>
</segment>
</content>
</document>
The output will be:
Hi there people
I want to
introduce you to
my world
Hi there people
I want to
introduce you to
my world
I have created a function to process the xml trannscript.
function loadTranscript($xml) {
$getfile = file_get_contents($xml);
$arr = simplexml_load_string($getfile);
foreach ($arr->content->segment->sentence as $sent) {
$count = str_word_count($sent,1);
$a=array_chunk($count,3);
foreach ($a as $a){
echo implode(' ',$a);
echo PHP_EOL;
}
}
}
But was unable to produce the output. Is $sent considered an array? I want to break the sentences at XML level.
I'm not sure why everyone is so scared of SimpleXML, and I think it's definitely the right tool for this job.
$sent is not an array, but an object representing the <sentence> element and all its children; it has some array-like properties, but not ones that array_chunk can work with.
You can actually use array_chunk, but you need to do three things to make your current code work:
$sent from object to array with (array)$sent (which will give an array of all children of the <sentence> node) or (array)$sent->word (which will limit it to those called <word>, in case there was a mixture)array_chunk, not $count (which you don't need)foreach( $a as $a ))So:
$chunks = array_chunk((array)$sent->word, 3);
foreach ($chunks as $a_chunk) {
echo implode(' ', $a_chunk);
echo PHP_EOL;
}
Alternatively, you can do without array_chunk easily enough by just displaying a newline every third word:
$counter = 0;
foreach ( $words as $word ) {
$counter++;
echo $word;
if ( $counter % 3 == 0 ) {
echo PHP_EOL;
} else {
echo ' ';
}
}
Then all you need to do is nest that loop inside your existing one:
foreach ($arr->content->segment->sentence as $sent) {
$counter = 0;
foreach ( $sent->word as $word ) {
$counter++;
echo $word;
if ( $counter % 3 == 0 ) {
echo PHP_EOL;
} else {
echo ' ';
}
}
echo PHP_EOL;
}
Up to you which you think is cleaner, but it's good to understand both so you can adapt them to future needs.
Is $xml a string or a file path? I'm considering that is a string for this answer.
Use DOMDocument and make it happens
function loadTranscript($xml) {
$doc = new DOMDocument();
$doc->loadXML($xml);
$words = $doc->getElementsByTagName('word');
$i = 0;
foreach ($words as $word) {
if ($i >= 3) {
echo "\n";//it works on console. For browsers you should use echo "<br>";
$i = 0;
}
echo $word->nodeValue.' ';
$i++;
}
}
I used a extra $i flag to avoid the foreach inside another foreach, but you can adapt the code to your needs.
As suggested by @CD001 in the comments, following is a new version that consider more than one tag <sentence>.
function loadTranscript($xml) {
$doc = new DOMDocument();
$doc->loadXML($xml);
$sentences = $doc->getElementsByTagName('sentence');
foreach($sentences as $sentence) {
$words = $sentence->getElementsByTagName('word');
$i = 0;
foreach ($words as $word) {
if ($i >= 3) {
echo "\n";
$i = 0;
}
echo $word->nodeValue.' ';
$i++;
}
echo "\n";
}
}
To read the XML from a file, replace the $doc->loadXML($xml); by $doc->load('file/path/string.xml');
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With