Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing CSS in C#: extracting all URLs

I need to get all URLs (url() expressions) from CSS files. For example:

b { background: url(img0) }
b { background: url("img1") }
b { background: url('img2') }
b { background: url( img3 ) }
b { background: url( "img4" ) }
b { background: url( 'img5' ) }
b { background: url (img6) }
b { background: url ("img7") }
b { background: url ('img8') }
{ background: url('noimg0) }
{ background: url(noimg1') }
/*b { background: url(noimg2) }*/
b { color: url(noimg3) }
b { content: 'url(noimg4)' }
@media screen and (max-width: 1280px) { b { background: url(img9) } }
b { background: url(img10) }

I need to get all img* URLs, but not noimg* URLs (invalid syntax or invalid property or inside comments).

I've tried using good old regular expressions. After some trial and error I got this:

private static IEnumerable<string> ParseUrlsRegex (string source)
{
    var reUrls = new Regex(@"(?nx)
        url \s* \( \s*
            (
                (?! ['""] )
                (?<Url> [^\)]+ )
                (?<! ['""] )
                |
                (?<Quote> ['""] )
                (?<Url> .+? )
                \k<Quote>
            )
        \s* \)");
    return reUrls.Matches(source)
        .Cast<Match>()
        .Select(match => match.Groups["Url"].Value);
}

That's one crazy regex, but it still doesn't work -- it matches 3 invalid URLs (namely, 2, 3 and 4). Furthermore, everyone will say that using regex for parsing complex grammar is wrong.

Let's try another approach. According to this question, the only viable option is ExCSS (others are either too simple or outdated). With ExCSS I got this:

    private static IEnumerable<string> ParseUrlsExCss (string source)
    {
        var parser = new StylesheetParser();
        parser.Parse(source);
        return parser.Stylesheet.RuleSets
            .SelectMany(i => i.Declarations)
            .SelectMany(i => i.Expression.Terms)
            .Where(i => i.Type == TermType.Url)
            .Select(i => i.Value);
    }

Unlike regex solution, this one doesn't list invalid URLs. But it doesn't list some valid ones! Namely, 9 and 10. Looks like this is known issue with some CSS syntax, and it can't be fixed without rewriting the whole library from scratch. ANTLR rewrite seems to be abandoned.

Question: How to extract all URLs from CSS files? (I need to parse any CSS files, not only the one provided as an example above. Please don't heck for "noimg" or assume one-line declarations.)

N.B. This is not a "tool recommendation" question, as any solution will be fine, be it a piece of code, a fix to one of the above solutions, a library or anything else; and I've clearly defined the function I need.

like image 483
Athari Avatar asked Aug 15 '13 21:08

Athari


2 Answers

Finally got Alba.CsCss, my port of CSS parser from Mozilla Firefox, working.

First and foremost, the question contains two errors:

  1. url (img) syntax is incorrect, because space is not allowed between url and ( in CSS grammar. Therefore, "img6", "img7" and "img8" should not be returned as URLs.

  2. An unclosed quote in url function (url('img)) is a serious syntax error; web browsers, including Firefox, do not seem to recover from it and simply skip the rest of the CSS file. Therefore, requiring the parser to return "img9" and "img10" is unnecessary (but necessary if the two problematic lines are removed).

With CsCss, there are two solutions.

The first solution is to rely just on the tokenizer CssScanner.

List<string> uris = new CssLoader().GetUris(source).ToList();

This will return all "img" URLs (except mentioned in the error #1 above), but will also include "noimg3" as property names are not checked.

The second solution is to properly parse the CSS file. This will most closely mimic the behavior of browsers (including stopping parsing after an unclosed quote).

var css = new CssLoader().ParseSheet(source, SheetUri, BaseUri);
List<string> uris = css.AllStyleRules
    .SelectMany(styleRule => styleRule.Declaration.AllData)
    .SelectMany(prop => prop.Value.Unit == CssUnit.List
        ? prop.Value.List : new[] { prop.Value })
    .Where(value => value.Unit == CssUnit.Url)
    .Select(value => value.OriginalUri)
    .ToList();

If the two problematic lines are removed, this will return all correct "img" URLs.

(The LINQ query is complex, because background-image property in CSS3 can contain a list of URLs.)

like image 131
Athari Avatar answered Oct 01 '22 19:10

Athari


RegEx is a very powerful tool. But when a bit more flexibility is needed, I prefer to just write a little code.

So for a non-RegEx solution, I came up with the following. Note that a bit more work would be needed to make this code more generic to handle any CSS file. For that, I would also use my text parsing helper class.

IEnumerable<string> GetUrls(string css)
{
    char[] trimChars = new char[] { '\'', '"', ' ', '\t', };

    foreach (var line in css.Split(new string[] { Environment.NewLine }, StringSplitOptions.RemoveEmptyEntries))
    {
        // Extract portion within curly braces (this version assumes all on one line)
        int start = line.IndexOf('{');
        int end = line.IndexOf('}', start + 1);
        if (start < 0 || end < 0)
            continue;
        start++; end--; // Remove braces

        // Get value portion
        start = line.IndexOf(':', start);
        if (start < 0)
            continue;

        // Extract value and trime whitespace and quotes
        string content = line.Substring(start + 1, end - start).Trim(trimChars);

        // Extract URL from url() value
        if (!content.StartsWith("url", StringComparison.InvariantCultureIgnoreCase))
            continue;
        start = content.IndexOf('(');
        end = content.IndexOf(')', start + 1);
        if (start < 0 || end < 0)
            continue;
        start++;
        content = content.Substring(start, end - start).Trim(trimChars);

        if (!content.StartsWith("noimg", StringComparison.InvariantCultureIgnoreCase))
            yield return content;
    }
}

UPDATE:

What you appear to be asking seems beyond the scope of a simple how-to question for stackoverflow. I do not believe you will get satisfactory results using regular expressions. You will need some code to parse your CSS, and handle all the special cases that come with it.

Since I've written a lot of parsing code and had a bit of time, I decided to play with this a bit. I wrote a simple CSS parser and wrote an article about it. You can read the article and download the code (for free) at A Simple CSS Parser.

My code parses a block of CSS and stores the information in data structures. My code separates and stores each property/value pair for each rule. However, a bit more work is still needed to get the URL from the property values. You will need to parse them from the property value.

The code I originally posted will give you a start of how you might approach this. But if you want a truly robust solution, then some more sophisticated code will be needed. You might want to take a look at my code to parse the CSS. I use techniques in that code that could be used to easy handle values such as url('img(1)'), such as parsing a quoted value.

I think this is a pretty good start. I could write the remaining code for you as well. But what's the fun in that. :)

like image 35
Jonathan Wood Avatar answered Oct 01 '22 21:10

Jonathan Wood