Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex - Convert HTML to valid XML tag [duplicate]

Tags:

regex

php

I need help writing a regex function that converts HTML string to a valid XML tag name. Ex: It takes a string and does the following:

  • If an alphabet or underscore occurs in the string, it keeps it
  • If any other character occurs, it's removed from the output string.
  • If any other character occurs between words or letters, it's replaced with an Underscore.
Ex:
Input: Date Created
Ouput: Date_Created

Input: Date<br/>Created
Output: Date_Created

Input: Date\nCreated
Output: Date_Created

Input: Date    1 2 3 Created
Output: Date_Created

Basically the regex function should convert the HTML string to a valid XML tag.

like image 530
Jake Avatar asked Jun 03 '12 04:06

Jake


2 Answers

A bit of regex and a bit of standard functions:

function mystrip($s)
{
        // add spaces around angle brackets to separate tag-like parts
        // e.g. "<br />" becomes " <br /> "
        // then let strip_tags take care of removing html tags
        $s = strip_tags(str_replace(array('<', '>'), array(' <', '> '), $s));

        // any sequence of characters that are not alphabet or underscore
        // gets replaced by a single underscore
        return preg_replace('/[^a-z_]+/i', '_', $s);
}
like image 138
Ja͢ck Avatar answered Sep 18 '22 10:09

Ja͢ck


Try this

$result = preg_replace('/([\d\s]|<[^<>]+>)/', '_', $subject);

Explanation

"
(               # Match the regular expression below and capture its match into backreference number 1
                   # Match either the regular expression below (attempting the next alternative only if this one fails)
      [\d\s]          # Match a single character present in the list below
                         # A single digit 0..9
                         # A whitespace character (spaces, tabs, and line breaks)
   |               # Or match regular expression number 2 below (the entire group fails if this one fails to match)
      <               # Match the character “<” literally
      [^<>]           # Match a single character NOT present in the list “<>”
         +               # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
      >               # Match the character “>” literally
)
"
like image 22
Cylian Avatar answered Sep 19 '22 10:09

Cylian