I have a set of objects of type Idea
public class Idea
{
public string Title { get; set; }
public string Body { get; set; }
}
I want to search this objects by substring. For example when I have object of title "idea", I want it to be found when I enter any substring of "idea": i, id, ide, idea, d, de, dea, e, ea, a.
I'm using RavenDB for storing data. The search query looks like that:
var ideas = session
.Query<IdeaByBodyOrTitle.IdeaSearchResult, IdeaByBodyOrTitle>()
.Where(x => x.Query.Contains(query))
.As<Idea>()
.ToList();
while the index is following:
public class IdeaByBodyOrTitle : AbstractIndexCreationTask<Idea, IdeaByBodyOrTitle.IdeaSearchResult>
{
public class IdeaSearchResult
{
public string Query;
public Idea Idea;
}
public IdeaByBodyOrTitle()
{
Map = ideas => from idea in ideas
select new
{
Query = new object[] { idea.Title.SplitSubstrings().Concat(idea.Body.SplitSubstrings()).Distinct().ToArray() },
idea
};
Indexes.Add(x => x.Query, FieldIndexing.Analyzed);
}
}
SplitSubstrings()
is an extension method which returns all distinct substrings of given string:
static class StringExtensions
{
public static string[] SplitSubstrings(this string s)
{
s = s ?? string.Empty;
List<string> substrings = new List<string>();
for (int i = 0; i < s.Length; i++)
{
for (int j = 1; j <= s.Length - i; j++)
{
substrings.Add(s.Substring(i, j));
}
}
return substrings.Select(x => x.Trim()).Where(x => !string.IsNullOrEmpty(x)).Distinct().ToArray();
}
}
This is not working. Particularly because RavenDB is not recognizing SplitSubstrings()
method, because it is in my custom assembly. How to make this work, basically how to force RavenDB to recognize this method ? Besides that, is my approach appropriate for this kind of searching (searching by substring) ?
EDIT
Basically, I want to build auto-complete feature on this search, so it need to be fast.
Btw: I'm using RavenDB - Build #960
You can perform substring search across multiple fields using following approach:
( 1 )
public class IdeaByBodyOrTitle : AbstractIndexCreationTask<Idea>
{
public IdeaByBodyOrTitle()
{
Map = ideas => from idea in ideas
select new
{
idea.Title,
idea.Body
};
}
}
on this site you can check, that:
"By default, RavenDB uses a custom analyzer called LowerCaseKeywordAnalyzer for all content. (...) The default values for each field are FieldStorage.No in Stores and FieldIndexing.Default in Indexes."
So by default, if you check the index terms inside the raven client, it looks following:
Title Body
------------------ -----------------
"the idea title 1" "the idea body 1"
"the idea title 2" "the idea body 2"
Based on that, wildcard query can be constructed:
var wildquery = string.Format("*{0}*", QueryParser.Escape(query));
which is then used with the .In
and .Where
constructions (using OR operator inside):
var ideas = session.Query<User, UsersByDistinctiveMarks>()
.Where(x => x.Title.In(wildquery) || x.Body.In(wildquery));
( 2 )
Alternatively, you can use pure lucene query:
var ideas = session.Advanced.LuceneQuery<Idea, IdeaByBodyOrTitle>()
.Where("(Title:" + wildquery + " OR Body:" + wildquery + ")");
( 3 )
You can also use .Search
expression, but you have to construct your index differently if you want to search across multiple fields:
public class IdeaByBodyOrTitle : AbstractIndexCreationTask<Idea, IdeaByBodyOrTitle.IdeaSearchResult>
{
public class IdeaSearchResult
{
public string Query;
public Idea Idea;
}
public IdeaByBodyOrTitle()
{
Map = ideas => from idea in ideas
select new
{
Query = new object[] { idea.Title, idea.Body },
idea
};
}
}
var result = session.Query<IdeaByBodyOrTitle.IdeaSearchResult, IdeaByBodyOrTitle>()
.Search(x => x.Query, wildquery,
escapeQueryOptions: EscapeQueryOptions.AllowAllWildcards,
options: SearchOptions.And)
.As<Idea>();
summary:
Also have in mind that *term*
is rather expensive, especially the leading wildcard. In this post you can find more info about it. There is said, that leading wildcard forces lucene to do a full scan on the index and thus can drastically slow down query-performance. Lucene internally stores its indexes (actually the terms of string-fields) sorted alphabetically and "reads" from left to right. That’s the reason why it is fast to do a search for a trailing wildcard and slow for a leading one.
So alternatively x.Title.StartsWith("something")
can be used, but this obviously do not search across all substrings. If you need fast search, you can change the Index option for the fields you want to search on to be Analyzed but it again will not search across all substrings.
If there is a spacebar inside of the substring query, please check this question for possible solution. For making suggestions check http://architects.dzone.com/articles/how-do-suggestions-ravendb.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With