Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ideal functional language for implementing a full text search with .NET [closed]

During my studies of computer science I came accross some functional languages like Prolog but now I have only been doing imperative stuff like C#, Ruby JavaScript and Java for the past 10 years. Currently I am creating a full text search engine for an online shop and I have come quite far the "imperative way" already. But having stumbled accross some functional languages like Haskell of Clojure it became clear that the functional paradigm fits so much better and that the imperative way is just not the right tool for this job.

So we have a full text index of some 10 million records. Each record basically contains a word occurrence, along with the id and text position from the record of which it originates.

When the user enters a search string it is parsed into an expression tree. For example the search string "transformer 100 W" results in something like

AND('transformer%', OR(NEAR('100', 'W'), NEAR('100', 'watts'), '100W', '0.1kW'))

There is some extra "intelligence" going on here, but that is of no concern for this question.

The expression tree is then evaluated recursively and results in a couple of sql queries that could return up to 100,000 rows in the form of .NET-DataTables. These are then read into sets or dictionaries and, depending on the predicates, intersections and unions are applied in order to find all results that match the entire search expression. For the NEAR evaluation the position indexes of the found occurrences are compared as well. But this is all done imperatively, with a lot of for-loops.

Additionally there is a ranking function that adds up scores of the found word occurrences. Words that are only found as prefixes or with fuzzy matching (done by the database server) get lower scores than precise matches.

For each resulting item I also need to get a list of all word occurrences that were matched, in order to highlight these words in the result pages.

So roughly the evaluation algorithm is a function like

expression tree, full text index -> 
resulting items that match the expressin tree, 
each with a ranking sum 
and a list of all found word occurrences for this item

I am just giving a rough overview here, but I hope you get enough of a picture.

Now the "real world" constraints:

  • The whole application (up to now) is written in C#, so an easy integration with .NET is paramount.
  • Loads of data is read into .NET-DataTables and will then need to be evaluated and transformed. The results should be contained in .NET types (Dictionaries, Sets, Arrays, whatever...).
  • Performance is of great importance. At present my algorithm often takes two seconds for a search (not counting the sql), which is kind of ok, but should be improved. Our server has 16 processors, so parallel processing would be welcome. Since we get about one search request per second and the current implementation is single threaded, processor time is still available.
  • The language (and the compiler) should be mature.

Since I need to stay with .NET, I was looking into Clojure-CLR, F# and Scala for .NET.

I like the concepts of Clojure a lot, but right now I can not assess, whether it would be up to the job. Reading about F# gave me mixed feelings, since it seems to want to be able to do just about everything, whereas I would tend to a more "pure" mathematical approach for the given task. But maybe that is possible with F# as well and I am not yet aware of it. I haven't delved into Scala much yet, but it seems to be well established.

Any insights would be welcome!

like image 687
Majnu Avatar asked Nov 05 '12 14:11

Majnu


2 Answers

  • The whole application (up to now) is written in C#, so an easy integration with .NET is paramount.
  • Loads of data is read into .NET-DataTables and will then need to be evaluated and transformed. The results should be contained in .NET types (Dictionaries, Sets, Arrays, whatever...).

F# should be a superior choice. Being a first-class language in Visual Studio, F#'s interoperability with C# is quite good.

  • Performance is of great importance. At present my algorithm often takes two seconds for a search (not counting the sql), which is kind of ok, but should be improved. Our server has 16 processors, so parallel processing would be welcome. Since we get about one search request per second and the current implementation is single threaded, processor time is still available.

Assuming that you start by a functional-first and immutable implementation, it should be easy to parallelize your app. Moreover, asynchronous workflow is a bless for IO-bound applications like yours.

  • The language (and the compiler) should be mature.

I don't compare F# to Clojure and Scala on JVM, but F# is much more mature than Clojure CLR and Scala on .NET. In choosing F#, you're sure to have long-term commitment from Microsoft and help from ever-growing F# community.

When the user enters a search string it is parsed into an expression tree.

You can represent expression trees using discriminated unions. With the introduction of query expressions in F# 3.0, you are able to translate your logics into SQL queries easily. You can even push it further by defining a similar query language for your domain.

Reading about F# gave me mixed feelings, since it seems to want to be able to do just about everything, whereas I would tend to a more "pure" mathematical approach for the given task. But maybe that is possible with F# as well and I am not yet aware of it.

F# 3.0 introduces type providers to allow users access non-structured data in a type-safe way; you may want to look at this "F# 3.0 - Information Rich Programming" video for more insights. If you would like to use F# as a programming language for data mining, I have asked a related question and got pretty good responses here.

That said, your first feelings about F# may not be correct. From my experience, you can always stay as close to the functional and immutable side as you want. Given that you already have an interesting application, I suggest to get your hands dirty to know whether F# is the language for your purpose.

UPDATE:

Here is an F# prototype which demonstrates the idea:

/// You start by modeling your domain with a set of types.
/// FullText is a sequence of Records, which is processed on demand.
type Word = string
and Freq = int
and Record = {Occurrences: (Word * Freq) list; Id: string}
and FullText = Record seq

/// Model your expression tree by following the grammar closely.
type Expression =
    | Occur of Word
    | Near of Word * Word
    | And of Expression * Expression 
    | Or of Expression * Expression

/// Find wether a word w occurs in the occurrence list.
let occur w {Occurrences = xs} = xs |> Seq.map fst |> Seq.exists ((=) w)

/// Check whether two words are near each other.
/// Note that type annotation is only needed for the stub implementation.
let near (w1: Word) (w2: Word) (r: Record): bool = failwith "Not implemented yet"

/// Evaluate an expression tree.
/// The code is succinct and clear thanks to pattern matching. 
let rec eval expr r = 
    match expr with
    | Occur w -> occur w r
    | Near(w1, w2) -> near w1 w2 r
    | And(e1, e2) -> eval e1 r && eval e2 r
    | Or(e1, e2) -> eval e1 r || eval e2 r

/// Utility function which returns second element in a 3-tuple
let inline snd3 (_, x, _) = x

/// Get the rank of the record by adding up frequencies on the whole database.
let rank (r: Record) (ft: FullText): Freq = failwith "Not implemented yet"

/// Retrieve all records which match the expression tree.
let retrieve expr fullText =
    fullText |> Seq.filter (eval expr)
             |> Seq.map (fun r -> r, rank r fullText, r.Occurrences)
             |> Seq.sortBy snd3

/// An example query
let query = 
    And (Occur "transformer%", 
         Or (Or (Near ("100", "W"), Near ("100", "watts")), 
             Or (Occur "100W", Occur "0.1kW")))
like image 111
pad Avatar answered Oct 22 '22 13:10

pad


I'm curious why you're not considering LINQ as an option. It seems to satisfy all your criteria. Note I have no experience with Scala, so I can't comment on that.

  • The whole application (up to now) is written in C#, so an easy integration with .NET is paramount.
  • Loads of data is read into .NET-DataTables and will then need to be evaluated and transformed. The results should be contained in .NET types (Dictionaries, Sets, Arrays, whatever...).

Here, LINQ > F# > Clojure-CLR. If everything is already in C#, LINQ is going to be the easiest to integrate. Visual Studio support for things like intellisense and function definition navigation seems much better in a C#-only program. Calling Clojure from C# can be horrendous—in theory it should work OK, but in practice, be prepared to spend weeks figuring out why things aren't working the way you'd expect. It's really designed to be the top level thing; you call C# from Clojure, going the opposite direction really just isn't high on the Clojure-CLR developers' priority list; there's basic support, but you get what you get.

  • Performance is of great importance. At present my algorithm often takes two seconds for a search (not counting the sql), which is kind of ok, but should be improved. Our server has 16 processors, so parallel processing would be welcome. Since we get about one search request per second and the current implementation is single threaded, processor time is still available.

LINQ ~= F# > Clojure. I've read elsewhere that LINQ's performance can be shown to be slightly better than F# for most idiomatically-written algorithms, but they're close enough that it shouldn't matter. PLINQ makes parallelism easy. Clojure-CLR has mega-slow startup time, and the runtime overhead slows things down too.

  • The language (and the compiler) should be mature.

LINQ >= F# > Clojure. Not to say F# is immature at all, but Visual Studio support lags behind a bit, and there's much more production code in the world (and a lot more stack overflow answers) based on LINQ than F#.

Reading about F# gave me mixed feelings, since it seems to want to be able to do just about everything, whereas I would tend to a more "pure" mathematical approach for the given task. But maybe that is possible with F# as well and I am not yet aware of it.

None of the languages are pure pure like Haskell, but in terms of how difficult it makes it to write non-pure code, I'd rank it as: LINQ > Clojure > F# > Scala. LINQ can only be made impure by calling impure methods. Clojure has refs and atoms, F# anything can be designated mutable, and Scala (per my understanding) really is just Java with functional features bolted on.

The functional feature that F# and Scala both have going for them though is language support for pattern matching. Where in C# you'd need either some kind of inheritance hierarchy or chains of b?x:y operators to do things functionally (or if/else if you're fine with non-functional approach), pattern matching makes conditional operation on different variations of raw data types much more succinct. This maybe could be useful in your calculation of exact vs prefix vs fuzzy match rankings, however a b?x:y chain var alg = x.match == exact ? alg1 : x.match == prefix ? alg2 : alg3 in C# would be perfectly legible in this simple case—it's when the matching gets much more complex that language integrated pattern matching becomes more valuable.

Interestingly, I think the one aspect of your toolkit where F# would prove more useful than LINQ isn't the querying, which the name of LINQ itself should indicate it can handle, but the parsing of your search string into an expression tree. This is one area where functional languages and pattern matching really excels, and add in tools like FsLex and FsYacc can give you a big head start.

All that said, I think the decision comes down to where you're wanting to go. If you just want to clean up your search algorithms and be done with it, I'd advise the LINQ approach. But if you're wanting to piece-by-piece get into a more functional-oriented style for the whole program (and your company is willing to pay for the time you're committing to it), then do maybe look at the F# option. Either way I'd do the LINQ option first, as that will likely be more straightforward to you, and help guide your F# to be more functionally idiomatic once you start down that path.

Simplistically, here's what you'd want, just fill in your functions for your Near and Equal fetchers, and your GetRank and GetStrings functions, and use the below

static IEnumerable<Record> FetchRecords(this Tree tree) {
    return tree.Op == "OR"    ? tree.Args.SelectMany(FetchRecords).Distinct() :
           tree.Op == "AND"   ? tree.Args.Select(FetchRecords).Aggregate((intersect, current) => intersect.Intersect(current)) :
           tree.Op == "NEAR"  ? FetchValsNear(tree.Args[0].Op, tree.Args[1].Op) :
                                FetchValsEqual(tree.Op);
}

static IEnumerable<Record> FetchValsEqual(string s) {
    throw new NotImplementedException();
}

static IEnumerable<Record> FetchValsNear(string s1, string s2) {
    throw new NotImplementedException();
}

static IEnumerable<Tuple<Record, double, string[]>> OrderByRank(this IEnumerable<Record> vals) {
    return from val in vals
           let rank = GetRank(val)
           orderby rank
           let strings = GetStringsIn(val)
           select Tuple.Create(val, rank, strings);
}

static string[] GetStringsIn(Record val) {
    throw new NotImplementedException();
}

static double GetRank(Record val) {
    throw new NotImplementedException();
}

class Tree {
    public string Op;
    public Tree[] Args;
}

struct Record {/*your record here--use struct so Distinct and Intersect above work naturally (or use class and override Equals)*/}

like this:

foreach (var tuple in myTree.FetchRecords().AsParallel().OrderByRank().Take(30)) {
    // add to datagrid or whatever
}

This gives you both simple parallelizability, and laziness so that the GetStringsIn function is only executed on the records you take (in this case the top 30). (Note the AND selector can be simplified using any of the IntersectAll examples here).

like image 7
10 revs Avatar answered Oct 22 '22 12:10

10 revs