I'm trying to write a regular expression using the PCRE library in PHP. I need a regex to match only <code>&</code>, <code>></code> and <code><</code> chars that exist within string part of any XML node and not the tag declaration themselves. Input XML: <pre class="prettyprint"><code><pnode> <cnode>This string contains > and < and & chars.</cnode> </pnode> </code></pre> The idea is to to a search and replace these chars and convert them to XML entities equivalents. If I was to convert the entire XML to entities the XML would look like this: Entire XML converted to entities <pre class="prettyprint"><code>&lt;pnode&gt; &lt;cnode&gt;This string contains &gt; and &lt; and &amp; chars.&lt;/cnode&gt; &lt;/pnode&gt; </code></pre> I need it to look like this: Correct XML <pre class="prettyprint"><code><pnode> <cnode>This string contains &gt; and &lt and &amp; chars.</cnode> </pnode> </code></pre> I have tried to write a regular expression to match these chars using look-ahaead but I don't know enough to get this to work. My attempt (currently only attempting to match > symbols): <pre class="prettyprint"><code>/>(?=[^<]*<)/g </code></pre> Just to make it clear the XML I'm trying to fix comes from a 3rd party and they seem unable to fix it their end hence my attempt to fix it.

In the end I've opted to use the Tidy library in PHP. The code I used is shown below: <pre class="prettyprint"><code> // Specify configuration $config = array( 'input-xml' => true, 'show-warnings' => false, 'numeric-entities' => true, 'output-xml' => true); $tidy = new tidy(); $tidy->parseFile('feed.xml', $config, 'latin1'); $tidy->cleanRepair() </code></pre> This works perfectly correcting all the encoding errors and converting invalid characters to XML entities.

Regular expression to match ">", "<", "&" chars that appear inside XML nodes

Tags:

regex

php

xml

I'm trying to write a regular expression using the PCRE library in PHP.

I need a regex to match only &, > and < chars that exist within string part of any XML node and not the tag declaration themselves.

Input XML:

<pnode>
  <cnode>This string contains > and < and & chars.</cnode>
</pnode>

The idea is to to a search and replace these chars and convert them to XML entities equivalents.

If I was to convert the entire XML to entities the XML would look like this:

Entire XML converted to entities

&lt;pnode&gt;
  &lt;cnode&gt;This string contains &gt; and &lt; and &amp; chars.&lt;/cnode&gt;
&lt;/pnode&gt;

I need it to look like this:

Correct XML

<pnode>
  <cnode>This string contains &gt; and &lt and &amp; chars.</cnode>
</pnode>

I have tried to write a regular expression to match these chars using look-ahaead but I don't know enough to get this to work. My attempt (currently only attempting to match > symbols):

/>(?=[^<]*<)/g

Just to make it clear the XML I'm trying to fix comes from a 3rd party and they seem unable to fix it their end hence my attempt to fix it.

672

asked Feb 17 '10 16:02

Camsoft

3 Answers

In the end I've opted to use the Tidy library in PHP. The code I used is shown below:

  // Specify configuration
  $config = array(
    'input-xml'  => true,
    'show-warnings' => false,
    'numeric-entities' => true,
    'output-xml' => true);

  $tidy = new tidy();
  $tidy->parseFile('feed.xml', $config, 'latin1');
  $tidy->cleanRepair()

This works perfectly correcting all the encoding errors and converting invalid characters to XML entities.

194

answered Oct 19 '22 15:10

Camsoft

Classic example of garbage in, garbage out. The real solution is to fix the broken XML exporter, but obviously that's out of the scope of your problem. Sounds like you might have to manually parse the XML, run htmlentites() on the contents, then put the XML tags back.

answered Oct 19 '22 15:10

TravisO

I'm reasonably certain it's simply not possible. You need something that keeps track of nesting, and there's no way to get a regular expression to track nesting. Your choices are to fix the text first (when you probably can use an RE) or use something that's at least vaguely like an XML parser, specifically to the extent of keeping track of how the tags are nested.

There's a reason XML demands that these characters be escaped though -- without that, you can only guess about whether something is really a tag or not. For example, given something like:

    <tag>Text containing < and > characters</tag>

you and I can probably guess that the result should be: ...containing < and >... but I'm pretty sure the XML specification allows the extra whitespace, so officially "< and >" should be treated as a tag. You could, I suppose, assume that anything that looks like an un-matched tag really isn't intended to be a tag, but that's going to take some work too.

answered Oct 19 '22 16:10

Jerry Coffin

Related questions
                            
                                Display pdf generated using mpdf inline in mobile browsers
                            
                                Inconsistent error with Facebook Graph API in PHP - Failed to connect to graph.facebook.com port 443: Connection timed out
                            
                                Vagrant & Symfony 3.3: Failed to remove directory
                            
                                Opensubtitles hash function fails for large files
                            
                                "Heavy" simultaneous users Nginx - Laravel - Google compute engine
                            
                                The "--queued" option does not exist in Laravel 5.4
                            
                                Undefined variable: _ENV in Laravel 5.3
                            
                                PHP Openssl decrypt an AES Mysql Encryption
                            
                                Laravel never ending EXEC
                            
                                How to populate `identifier` and `providers` in Firebase custom authentication?
                            
                                How can I get this Google Login ID Token from this Android app to verify server-side?
                            
                                Why is in_array strict mode on integers slower than non-strict mode?
                            
                                Very slow laravel homestead/vagrant/virtualbox on Mac OSX
                            
                                Guzzle throwing RejectionException instead of ConnectionException on background process
                            
                                Running xinc on OpenBSD's Apache Server
                            
                                How do i backup a SQL database using PHP?
                            
                                PostgreSQL + PHP + UTF8 = invalid byte sequence for encoding
                            
                                Cookies/Sessions login system
                            
                                How do I use the subscriber option?
                            
                                PHP PECL_HTTP vs cURL Extension

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Regular expression to match ">", "<", "&" chars that appear inside XML nodes

Tags:

regex

php

xml

Camsoft

People also ask

3 Answers

Camsoft

TravisO

Jerry Coffin

Recent Activity

Donate For Us