Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse an InputStream for multiple patterns

I am parsing an InputStream for certain patterns to extract values from it, e.g. I would have something like

<span class="filename"><a href="http://example.com/foo">foo</a>

I don't want to use a full fledged html parser as I am not interested in the document structure but only in some well defined bits of information. (Only their order is important)
Currently I am using a very simple approach, I have an Object for each Pattern that contains a char[] of the opening and closing 'tag' (in the example opening would be <span class="filename"><a href="and closing " to get the url) and a position marker. For each character read by of the InputStream, I iterate over all Patterns and call the match(char) function that returns true once the opening pattern does match, from then on I collect the following chars in a StringBuilder until the now active pattern does match() again. I then call a function with the ID of the Pattern and the String read, to process it further.
While this works fine in most cases, I wanted to include regular expressions in the pattern, so I could also match something like

<span class="filename" id="234217"><a href="http://example.com/foo">foo</a>

At this point I was sure I would reinvent the wheel as this most certainly would have been done before, and I don't really want to write my own regex parser to begin with. However, I could not find anything that would do what I was looking for.
Unfortunately the Scanner class only matches one pattern, not a list of patterns, what alternatives could I use? It should not be heavy and work with Android.

like image 510
ben Avatar asked Apr 14 '11 18:04

ben


People also ask

What is the use of InputStream in Java?

It represents input stream of bytes. Applications that are defining subclass of InputStream must provide method, returning the next byte of input. A reset () method is invoked which re-positions the stream to the recently marked position.

How to implement InputStream using FileInputStream class?

Here are some of the commonly used methods: markSupported () - checks if the mark () and reset () method is supported in the stream Here is how we can implement InputStream using the FileInputStream class. Suppose we have a file named input.txt with the following content.

How to return the next byte of input in InputStream?

Applications that are defining subclass of InputStream must provide method, returning the next byte of input. A reset () method is invoked which re-positions the stream to the recently marked position.

What are the arguments of the parse method?

These criteria are reflected in the arguments of the parse method. The first argument is an InputStream for reading the document to be parsed. If this document stream can not be read, then parsing stops and the thrown IOException is passed up to the client application.


1 Answers

You mean you want to match any <span> element with a given class attribute, irrespective of other attributes it may have? That's easy enough:

Scanner sc = new Scanner(new File("test.txt"), "UTF-8");
Pattern p = Pattern.compile(
    "<span[^>]*class=\"filename\"[^>]*>\\s*<a[^>]*href=\"([^\"]+)\""
);
while (sc.findWithinHorizon(p, 0) != null)
{
  MatchResult m = sc.match();
  System.out.println(m.group(1));
}

The file "test.txt" contains the text of your question, and the output is:

http://example.com/foo
and closing
http://example.com/foo
like image 92
Alan Moore Avatar answered Oct 13 '22 20:10

Alan Moore