Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

RegExp not working on read HTML file

First of all, I know how most RegExp questions go; and this is not one of those, "please write my code" questions.

My confusion lies in the fact that my RegExp works on regexr, and in chrome's dev tools when polling the document.body.textContent, but not on an HTML file after I have read it in io.js.

io.js is version 1.5.1, running on windows 8

Why would it work in both places listed, but not in io.js? Am I not taking something into consideration that io.js does to read files?

My RegExp should be matching "@{each ___->___} text and line breaks @{/each}" as it does in the link below, but instead, it returns null

Here is what I'm trying to use: http://regexr.com/3aldk

RegExp:

/@\{each ([a-zA-Z0-9->.]*)\}([\s\S]*)@\{\/each}/g

JS (Example):

fs.readFile('view.html', {encoding:'utf8'}, function(error, html) {
    console.log(html.match(myRegExp)); // null
});

HTML:

<!doctype html>
<html>
    <head>
        <title>@{title}</title>
    </head>
    <body>
        <h1>@{foo.bar}</h1>
        <p>
            Lorem ipsum dolor sit amet, @{foo.baz.hoo}
        </p>
        @{each people->person}
            <div>
                <b>@{person.name}:</b> @{person.age}
            </div>
        @{/each}
    </body>
</html>

Am I missing something obvious, like a character, that is present on the back side, but not once served?

like image 562
ndugger Avatar asked Mar 22 '15 16:03

ndugger


People also ask

Why you cant parse HTML with regex?

Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts.

Can I use regex in HTML?

While arbitrary HTML with only a regex is impossible, it's sometimes appropriate to use them for parsing a limited, known set of HTML. If you have a small set of HTML pages that you want to scrape data from and then stuff into a database, regexes might work fine.

What does* n in regex mean?

Allows ASCII codes to be used in regular expressions. \x n. Matches n, where n is a hexadecimal escape value. Hexadecimal escape values must be exactly two digits long. For example, \x41 matches A .

What is regular expression in HTML?

A regular expression is a pattern of characters. The pattern is used to do pattern-matching "search-and-replace" functions on text.


1 Answers

The issue here lies on the fine line between specification and implementations.

ECMAscript 5.1 Specification states that:

A - character can be treated literally or it can denote a range. It is treated literally if it is the first or last character of ClassRanges, the beginning or end limit of a range specification, or immediately follows a range specification.

Regular-Expressions.info notes that:

Hyphens at other positions in character classes where they can't form a range may be interpreted as literals or as errors. Regex flavors are quite inconsistent about this.

Conclusions:

The safe way of including a dash - minus sign in a character class is by either:

  • escaping it (eg. [a-zA-Z0-9\->.])
  • placing it as the first char. in the class (eg. [-.>a-zA-Z0-9])
    • exception: in a negated class it goes 2nd, right after ^ (eg. [^-.>a-zA-Z0-9])
  • placing it last in the class (eg. [a-zA-Z0-9.>-])

General coding guidelines suggest placing your ranges first and ending the character class with the hyphen, this avoids ambiguity and helps readability.


Summing it up, your RegEx should become:

/@\{each ([a-zA-Z0-9>.-]*)\}([\s\S]*)@\{\/each}/g

As an additional tip:

you could also rewrite [\s\S] (any whitespace char. or any non-whitespace char.) into [^] (not nothing)

which would end you up with the following RegEx:

/@\{each ([a-zA-Z0-9>.-]*)\}([^]*)@\{\/each}/g

JavaScript ... treats [^] as a negated empty character class that matches any single character. - source

like image 52
CSᵠ Avatar answered Oct 16 '22 08:10

CSᵠ