Parse an InputStream for multiple patterns

Q: What is the use of InputStream in Java?

It represents input stream of bytes. Applications that are defining subclass of InputStream must provide method, returning the next byte of input. A reset () method is invoked which re-positions the stream to the recently marked position.

Q: How to implement InputStream using FileInputStream class?

Here are some of the commonly used methods: markSupported () - checks if the mark () and reset () method is supported in the stream Here is how we can implement InputStream using the FileInputStream class. Suppose we have a file named input.txt with the following content.

Q: How to return the next byte of input in InputStream?

Applications that are defining subclass of InputStream must provide method, returning the next byte of input. A reset () method is invoked which re-positions the stream to the recently marked position.

Q: What are the arguments of the parse method?

These criteria are reflected in the arguments of the parse method. The first argument is an InputStream for reading the document to be parsed. If this document stream can not be read, then parsing stops and the thrown IOException is passed up to the client application.

Tags:

java

regex

pattern-matching

inputstream

I am parsing an InputStream for certain patterns to extract values from it, e.g. I would have something like

<span class="filename"><a href="http://example.com/foo">foo</a>

I don't want to use a full fledged html parser as I am not interested in the document structure but only in some well defined bits of information. (Only their order is important)
Currently I am using a very simple approach, I have an Object for each Pattern that contains a char[] of the opening and closing 'tag' (in the example opening would be <span class="filename"><a href="and closing " to get the url) and a position marker. For each character read by of the InputStream, I iterate over all Patterns and call the match(char) function that returns true once the opening pattern does match, from then on I collect the following chars in a StringBuilder until the now active pattern does match() again. I then call a function with the ID of the Pattern and the String read, to process it further.
While this works fine in most cases, I wanted to include regular expressions in the pattern, so I could also match something like

<span class="filename" id="234217"><a href="http://example.com/foo">foo</a>

At this point I was sure I would reinvent the wheel as this most certainly would have been done before, and I don't really want to write my own regex parser to begin with. However, I could not find anything that would do what I was looking for.
Unfortunately the Scanner class only matches one pattern, not a list of patterns, what alternatives could I use? It should not be heavy and work with Android.

510

asked Apr 14 '11 18:04

ben

1 Answers

You mean you want to match any <span> element with a given class attribute, irrespective of other attributes it may have? That's easy enough:

Scanner sc = new Scanner(new File("test.txt"), "UTF-8");
Pattern p = Pattern.compile(
    "<span[^>]*class=\"filename\"[^>]*>\\s*<a[^>]*href=\"([^\"]+)\""
);
while (sc.findWithinHorizon(p, 0) != null)
{
  MatchResult m = sc.match();
  System.out.println(m.group(1));
}

The file "test.txt" contains the text of your question, and the output is:

http://example.com/foo
and closing
http://example.com/foo

answered Oct 13 '22 20:10

Alan Moore

Related questions
                            
                                User input validation in managed bean problem(JSF 2.0 )
                            
                                How to propagate Spring transaction to another thread?
                            
                                Clarification on behavior of BigDecimal.stripTrailingZeroes()
                            
                                Problem with reverse engineering a many-to-one unidirectional association with hibernate tools
                            
                                How do I integrate BIRT logging into application logging?
                            
                                Disabling Android Keyboard's 'Go' Button for WebView Text Entry
                            
                                Sending image from client to server
                            
                                Description for Standard MBean
                            
                                Migrating Hibernate 3.2.5 to 3.6
                            
                                Java Primitive Implementation
                            
                                Open source ABNF Parser implementation for Java? [closed]
                            
                                Creating with Maven two separate jars, without dependencies and with ONLY dependencies
                            
                                Google ClientLogin authentication
                            
                                Hibernate, Spring, @Transactional - surround with try/catch?
                            
                                Pattern for blocking Java Swing user in worker thread
                            
                                Safe Publication without happens-before? Anyhow besides final?
                            
                                How to make a while to run until scanner get input?
                            
                                How to deserialize the following json using Jackson
                            
                                What is the life span of an ajax call?
                            
                                Doing context:component-scan programatic way?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With