Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I use regex to search ignoring certain characters with NSPredicate?

In Hebrew, there are certain vowels that NSPredicate fails to ignore even when using the 'd' (diacritic insensitive) modifier in the predicate. I was told that the solution is to use regular expressions to do the search.

How do I take a search string and "use regex" to search hebrew text that contains vowels, ignoring those vowels?

Edit:

In other words, If I wanted to search the following text, disregarding dashes and asterisks, how would I do so using regex?

Example Text:

I w-en*t t-o the st*o*r*-e yes-ster*day.

Edit 2:

Essentially, I want to:

  1. Take an input string from a user
  2. Take a string to search
  3. Use a regex based on the user's search string to search for "contains" matches in the larger block of text. The regex should ignore vowels as shown above.

Edit 3:

Here's how I'm implementing my search:

//
//  The user updated the search text
//

- (BOOL)searchDisplayController:(UISearchDisplayController *)controller 
shouldReloadTableForSearchString:(NSString *)searchString{

    NSMutableArray *unfilteredResults = [[[[self.fetchedResultsController sections] objectAtIndex:0] objects] mutableCopy];

    if (self.filteredArray == nil) {
        self.filteredArray = [[[NSMutableArray alloc ] init] autorelease];
    }

    [filteredArray removeAllObjects];

    NSPredicate *predicate;

    if (controller.searchBar.selectedScopeButtonIndex == 0) {
        predicate = [NSPredicate predicateWithFormat:@"articleTitle CONTAINS[cd] %@", searchString];
    }else if (controller.searchBar.selectedScopeButtonIndex == 1) {
        predicate = [NSPredicate predicateWithFormat:@"articleContent CONTAINS[cd] %@", searchString];            
    }else if (controller.searchBar.selectedScopeButtonIndex == 2){
        predicate = [NSPredicate predicateWithFormat:@"ANY tags.tagText CONTAINS[cd] %@", searchString];
    }else{
        predicate = [NSPredicate predicateWithFormat:@"(ANY tags.tagText CONTAINS[cd] %@) OR (dvarTorahTitle CONTAINS[cd] %@) OR (dvarTorahContent CONTAINS[cd] %@)", searchString,searchString,searchString];
    }

    for (Article *article in unfilteredResults) {

        if ([predicate evaluateWithObject:article]) {
            [self.filteredArray addObject:article];
        }

    }

    [unfilteredResults release];


    return YES;
}

Edit 4:

I am not required to use regex for this, was just advised to do so. If you have another way that works, go for it!

Edit 5:

I've modified my search to look like this:

NSInteger length = [searchString length];

NSString *vowelsAsRegex = @"[\\u5B0-\\u55C4]*";

NSMutableString *modifiedSearchString = [searchString mutableCopy];

for (int i = length; i > 0; i--) {
    [modifiedSearchString insertString:vowelsAsRegex atIndex:i];
}

if (controller.searchBar.selectedScopeButtonIndex == 0) {
            predicate = [NSPredicate predicateWithFormat:@"articleTitle CONTAINS[cd] %@", modifiedSearchString];
        }else if (controller.searchBar.selectedScopeButtonIndex == 1) {
            predicate = [NSPredicate predicateWithFormat:@"articleContent CONTAINS[cd] %@", modifiedSearchString];            
        }else if (controller.searchBar.selectedScopeButtonIndex == 2){
            predicate = [NSPredicate predicateWithFormat:@"ANY tags.tagText CONTAINS[cd] %@", modifiedSearchString];
        }else{
            predicate = [NSPredicate predicateWithFormat:@"(ANY tags.tagText CONTAINS[cd] %@) OR (dvarTorahTitle CONTAINS[cd] %@) OR (dvarTorahContent CONTAINS[cd] %@)", modifiedSearchString,modifiedSearchString,modifiedSearchString];
        }

for (Article *article in unfilteredResults) {
  if ([predicate evaluateWithObject:article]) {
    [self.filteredArray addObject:article];
  }          
 }

I'm still missing something here, what do I need to do to make this work?

Edit 6:

Okay, almost there. I need to make two more changes to be finished with this.

I need to be able to add other ranges of characters to the regex, which might appear instead of, or in addition to the character in the other set. I've trie changing the first range to this:

[\u05b0-\u05c, \u0591-\u05AF]?

Something tells me that this is incorrect.

Also, I need the rest of the regex to be case insensitive. What modifier do I need to use with the .* regex to make it case insensitive?

like image 625
Moshe Avatar asked Nov 07 '11 03:11

Moshe


1 Answers

The Hebrew vowels are well defined in Unicode: Table of Hebrew characters and Marks

When you receive the input string from the user, you can insert the regular expression [\u05B0-\u05C4]* in between each character, and before and after the string. (The [] means match any of the included characters, and the * means match zero or more occurrences of the expression.) Then you can search the text block, using this as a regular expression. This expression allows you to find the exact string from the user's input. The user can also specify required vowels, which this expression would find.

I think that instead of trying to "ignore" the vowels, it would be easier to remove the vowels from both the large block of text and the user's string. Then you could search just the letters, as usual. This method would work if you don't need to display the vocalized text that the user found.

like image 87
JXG Avatar answered Nov 12 '22 18:11

JXG