Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use a regex to search backwards effectively?

Tags:

java

regex

I'm searching forward in an array of strings with a regex, like this:

for (int j = line; j < lines.length; j++) {  
    if (lines[j] == null || lines[j].isEmpty()) {
        continue;
    }
    matcher = pattern.matcher(lines[j]);
    if (matcher.find(offset)) {
        offset = matcher.end();
        line = j;
        System.out.println("found \""+matcher.group()+"\" at line "+line+" ["+matcher.start()+","+offset+"]");
        return true;
    }
    offset = 0;
}
return false;

Note that in my implementation above I save the line and offset for continuous searches.

Anyway, now I want to search backwards from that [line,offset].

My question: is there a way to search backwards with a regex efficiently? if not, what could be an alternative?

Clarification: By backwards I mean finding the previous match.
For example, say that I'm searching for "dana" in

"dana nama? dana kama! lama dana kama?" 

and got to the 2nd match. If I do matcher.find() again, I'll search forward and get the 3rd match. But I want to search backwards and get to the 1st match.
the code above should then output something like:

found "dana" at line 0 [0,3] // fwd
found "dana" at line 0 [11,14] // fwd
found "dana" at line 0 [0,3] // bwd
like image 796
Asaf Avatar asked Mar 01 '10 11:03

Asaf


People also ask

How can regex be used for document searching?

A regular expression is a form of advanced searching that looks for specific patterns, as opposed to certain terms and phrases. With RegEx you can use pattern matching to search for particular strings of characters rather than constructing multiple, literal search queries.

Why you should not use regex?

Regex isn't suited to parse HTML because HTML isn't a regular language. Regex probably won't be the tool to reach for when parsing source code. There are better tools to create tokenized outputs. I would avoid parsing a URL's path and query parameters with regex.

Is regex matching fast?

(but is slow in Java, Perl, PHP, Python, Ruby, ...)


3 Answers

Java's regular expression engine cannot search backwards. In fact, the only regex engine that I know that can do that is the one in .NET.

Instead of searching backwards, iterate over all the matches in a loop (searching forward). If the match is prior to the position you want, remember it. If the match is after the position you want, exit from the loop. In pseudo code (my Java is a little rusty):

storedmatch = ""
while matcher.find {
  if matcher.end < offset {
    storedmatch = matcher.group()
  } else {
    return storedmatch
  }
}
like image 172
Jan Goyvaerts Avatar answered Sep 17 '22 18:09

Jan Goyvaerts


The following class search backward and forward (of course).

I used this class in an application where the users can search strings in a long text (like search feature in a Web browser). So it's tested and works well for practical use cases.

It uses an approach similar to what Jan Goyvaerts describes. It selects a block of text before the start position and searches it forwards, returning the last match if there is one. If there is no match if selects a new block of text before the of block and searches that in the same way.

Use it like this:

Search s = new Search("Big long text here to be searched [...]");
s.setPattern("some regexp");

// search backwards or forward as many times as you like,
// the class keeps track where the last match was
MatchResult where = s.searchBackward();
where = s.searchBackward(); // next match
where = s.searchBackward(); // next match

//or search forward
where = s.searchForward();
where = s.searchForward();

And the class:

import java.util.regex.MatchResult;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

/*
 * Search regular expressions or simple text forward and backward in a CharSequence
 *
 * 
 * To simulate the backward search (that Java class doesn't have) the input data
 * is divided into chunks and each chunk is searched from last to first until a 
 * match is found (inter-chunk matches are returned from last to first too).
 *
 * The search can fail if the pattern/match you look for is longer than the chunk
 * size, but you can set the chunk size to a sensible size depending on the specific
 * application.
 *
 * Also, because the match could span between two adjacent chunks, the chunks are
 * partially overlapping. Again, this overlapping size should be set to a sensible
 * size.
 *
 * A typical application where the user search for some words in a document will
 * work perfectly fine with default values. The matches are expected to be between
 * 10-15 chars, so any chunk size and overlapping size bigger than this expected 
 * length will be fine.
 *
 * */
public class Search {

    private int BACKWARD_BLOCK_SIZE = 200;
    private int BACKWARD_OVERLAPPING = 20;

    private Matcher myFwdMatcher;
    private Matcher myBkwMatcher;
    private String  mySearchPattern;
    private int myCurrOffset;
    private boolean myRegexp;
    private CharSequence mySearchData;

    public Search(CharSequence searchData) {
        mySearchData = searchData;
        mySearchPattern = "";
        myCurrOffset = 0;
        myRegexp = true;
        clear();
    }

    public void clear() {
        myFwdMatcher = null;
        myBkwMatcher = null;
    }

    public String getPattern() {
        return mySearchPattern;
    }

    public void setPattern(String toSearch) {
        if ( !mySearchPattern.equals(toSearch) ) {
            mySearchPattern = toSearch;
            clear();
        }
    }

    public CharSequence getText() {
        return mySearchData;
    }

    public void setText(CharSequence searchData) {
        mySearchData = searchData;
        clear();
    }

    public void setSearchOffset(int startOffset) {
        if (myCurrOffset != startOffset) {
            myCurrOffset = startOffset;
            clear();
        }
    }

    public boolean isRegexp() {
        return myRegexp;
    }

    public void setRegexp(boolean regexp) {
        if (myRegexp != regexp) {
            myRegexp = regexp;
            clear();
        }
    }

    public MatchResult searchForward() {

        if (mySearchData != null) {

            boolean found;

            if (myFwdMatcher == null)
            {
                // if it's a new search, start from beginning
                String searchPattern = myRegexp ? mySearchPattern : Pattern.quote(mySearchPattern);
                myFwdMatcher = Pattern.compile(searchPattern, Pattern.CASE_INSENSITIVE).matcher(mySearchData);
                try {
                    found = myFwdMatcher.find(myCurrOffset);
                } catch (IndexOutOfBoundsException e) {
                    found = false;
                }
            }
            else
            {
                // continue searching
                found = myFwdMatcher.hitEnd() ? false : myFwdMatcher.find();
            }

            if (found) {
                MatchResult result = myFwdMatcher.toMatchResult();
                return onMatchResult(result);
            }
        }
        return onMatchResult(null);
    }

    public MatchResult searchBackward() {

        if (mySearchData != null) {

            myFwdMatcher = null;

            if (myBkwMatcher == null)
            {
                // if it's a new search, create a new matcher
                String searchPattern = myRegexp ? mySearchPattern : Pattern.quote(mySearchPattern);
                myBkwMatcher = Pattern.compile(searchPattern, Pattern.CASE_INSENSITIVE).matcher(mySearchData);
            }

            MatchResult result = null;
            boolean startOfInput = false;
            int start = myCurrOffset;
            int end = start;

            while (result == null && !startOfInput)
            {
                start -= BACKWARD_BLOCK_SIZE;
                if (start < 0) {
                    start = 0;
                    startOfInput = true;
                }
                try {
                    myBkwMatcher.region(start, end);
                } catch (IndexOutOfBoundsException e) {
                    break;
                }
                while ( myBkwMatcher.find() ) {
                    result = myBkwMatcher.toMatchResult();
                }
                end = start + BACKWARD_OVERLAPPING; // depending on the size of the pattern this could not be enough
                                                    //but how can you know the size of a regexp match beforehand?
            }

            return onMatchResult(result);
        }
        return onMatchResult(null);
    }

    private MatchResult onMatchResult(MatchResult result) {
        if (result != null) {
            myCurrOffset = result.start();    
        }
        return result;
    }
}

And if you like to test the class here is an usage example:

enter image description here

import java.awt.*;
import java.awt.event.*;
import javax.swing.*;
import javax.swing.event.*;
import java.util.regex.MatchResult;
import javax.swing.text.DefaultHighlighter;
import javax.swing.text.BadLocationException;

public class SearchTest extends JPanel implements ActionListener {

    protected JScrollPane scrollPane;
    protected JTextArea textArea;
    protected boolean docChanged = true;
    protected Search searcher;

    public SearchTest() {
        super(new BorderLayout());

        searcher = new Search("");

        JButton backButton = new JButton("Search backward");
        JButton fwdButton  = new JButton("Search forward");

        JPanel buttonPanel = new JPanel(new BorderLayout());
        buttonPanel.add(fwdButton, BorderLayout.EAST);
        buttonPanel.add(backButton, BorderLayout.WEST); 

        textArea = new JTextArea("Big long text here to be searched...", 20, 40);
        textArea.setEditable(true);
        scrollPane = new JScrollPane(textArea);

        final JTextField textField = new JTextField(40);

        //Add Components to this panel.
        add(buttonPanel, BorderLayout.NORTH);
        add(scrollPane, BorderLayout.CENTER);
        add(textField, BorderLayout.SOUTH);

        //Add actions
        backButton.setActionCommand("back");
        fwdButton.setActionCommand("fwd");
        backButton.addActionListener(this);
        fwdButton.addActionListener(this);

        textField.addActionListener( new ActionListener() {
            public void actionPerformed(ActionEvent e) {
                final String pattern = textField.getText();
                searcher.setPattern(pattern);
            }
        } );

        textArea.getDocument().addDocumentListener( new DocumentListener() { 
            public void insertUpdate(DocumentEvent e) { docChanged = true; }
            public void removeUpdate(DocumentEvent e) { docChanged = true; }
            public void changedUpdate(DocumentEvent e) { docChanged = true; }
        });
    }

    public void actionPerformed(ActionEvent e)  {

        if ( docChanged ) {
            final String newDocument = textArea.getText();
            searcher.setText(newDocument);
            docChanged = false;
        }

        MatchResult where = null;

        if ("back".equals(e.getActionCommand())) {
            where = searcher.searchBackward();
        } else if ("fwd".equals(e.getActionCommand())) {
            where = searcher.searchForward();
        }

        textArea.getHighlighter().removeAllHighlights();

        if (where != null) {
            final int start = where.start();
            final int end   = where.end();
            // highligh result and scroll
            try {
            textArea.getHighlighter().addHighlight(start, end, new DefaultHighlighter.DefaultHighlightPainter(Color.yellow));
            } catch (BadLocationException excp) {}
            textArea.scrollRectToVisible(new Rectangle(0, 0, scrollPane.getViewport().getWidth(), scrollPane.getViewport().getHeight()));
            SwingUtilities.invokeLater(new Runnable() {
                    @Override
                    public void run() { textArea.setCaretPosition(start); }
            });
        } else if (where == null) {
            // no match, so let's wrap around
            if ("back".equals(e.getActionCommand())) {
                searcher.setSearchOffset( searcher.getText().length() -1 );
            } else if ("fwd".equals(e.getActionCommand())) {
                searcher.setSearchOffset(0);
            }
        }
    }

    private static void createAndShowGUI() {
        //Create and set up the window.
        JFrame frame = new JFrame("SearchTest");
        frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);

        //Add contents to the window.
        frame.add(new SearchTest());

        //Display the window.
        frame.pack();
        frame.setVisible(true);
    }

    public static void main(String[] args) {
        //Schedule a job for the event dispatch thread:
        //creating and showing this application's GUI.
        javax.swing.SwingUtilities.invokeLater(new Runnable() {
            public void run() {
                createAndShowGUI();
            }
        });
    }
}
like image 21
luca Avatar answered Sep 17 '22 18:09

luca


I use the following simple class to search backwards in java

public class ReverseMatcher {
   private final Matcher _matcher;
   private final Stack<MatchResult> _results = new Stack<>();

   public ReverseMatcher(Matcher matcher){
       _matcher = matcher;
   }

   public boolean find(){
       return find(_matcher.regionEnd());
   }

   public boolean find(int start){
       if (_results.size() > 0){
           _results.pop();
           return _results.size() > 0;
       }
       boolean res = false;
       while (_matcher.find()){            
           if (_matcher.end() > start)
               break;
           res = true;
           _results.push(_matcher.toMatchResult());
       }
       return res;
   }

   public String group(int group){
       return _results.peek().group(group);               
   }

   public String group(){
       return _results.peek().group();               
   }

   public int start(){
       return _results.peek().start();
   }    

   public int end(){
       return _results.peek().end();
   }
}

using:

String srcString = "1 2 3 4 5 6 7 8 9";
String pattern = "\\b[0-9]*\\b";
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(srcString);
ReverseMatcher rm = new ReverseMatcher(m);
while (rm.find())
   System.out.print(rm.group() + " ");

output: 9 8 7 6 5 4 3 2 1

or

while (rm.find(9))
   System.out.print(rm.group() + " ");

output: 5 4 3 2 1

like image 41
kosotd Avatar answered Sep 20 '22 18:09

kosotd