I am parsing an InputStream for certain patterns to extract values from it, e.g. I would have something like
<span class="filename"><a href="http://example.com/foo">foo</a>
I don't want to use a full fledged html parser as I am not interested in the document structure but only in some well defined bits of information. (Only their order is important)
Currently I am using a very simple approach, I have an Object for each Pattern that contains a char[] of the opening and closing 'tag' (in the example opening would be <span class="filename"><a href="
and closing "
to get the url) and a position marker. For each character read by of the InputStream, I iterate over all Patterns and call the match(char)
function that returns true once the opening pattern does match, from then on I collect the following chars in a StringBuilder until the now active pattern does match() again. I then call a function with the ID of the Pattern and the String read, to process it further.
While this works fine in most cases, I wanted to include regular expressions in the pattern, so I could also match something like
<span class="filename" id="234217"><a href="http://example.com/foo">foo</a>
At this point I was sure I would reinvent the wheel as this most certainly would have been done before, and I don't really want to write my own regex parser to begin with. However, I could not find anything that would do what I was looking for.
Unfortunately the Scanner
class only matches one pattern, not a list of patterns, what alternatives could I use? It should not be heavy and work with Android.
It represents input stream of bytes. Applications that are defining subclass of InputStream must provide method, returning the next byte of input. A reset () method is invoked which re-positions the stream to the recently marked position.
Here are some of the commonly used methods: markSupported () - checks if the mark () and reset () method is supported in the stream Here is how we can implement InputStream using the FileInputStream class. Suppose we have a file named input.txt with the following content.
Applications that are defining subclass of InputStream must provide method, returning the next byte of input. A reset () method is invoked which re-positions the stream to the recently marked position.
These criteria are reflected in the arguments of the parse method. The first argument is an InputStream for reading the document to be parsed. If this document stream can not be read, then parsing stops and the thrown IOException is passed up to the client application.
You mean you want to match any <span>
element with a given class
attribute, irrespective of other attributes it may have? That's easy enough:
Scanner sc = new Scanner(new File("test.txt"), "UTF-8");
Pattern p = Pattern.compile(
"<span[^>]*class=\"filename\"[^>]*>\\s*<a[^>]*href=\"([^\"]+)\""
);
while (sc.findWithinHorizon(p, 0) != null)
{
MatchResult m = sc.match();
System.out.println(m.group(1));
}
The file "test.txt" contains the text of your question, and the output is:
http://example.com/foo and closing http://example.com/foo
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With