Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get documents that contain sub-string in FaunaDB

Tags:

go

faunadb

I'm trying to retrieve all the tasks documents that have the string first in their name.

I currently have the following code, but it only works if I pass the exact name:

res, err := db.client.Query(
    f.Map(
        f.Paginate(f.MatchTerm(f.Index("tasks_by_name"), "My first task")),
        f.Lambda("ref", f.Get(f.Var("ref"))),
    ),
)

I think I can use ContainsStr() somewhere, but I don't know how to use it in my query.

Also, is there a way to do it without using Filter()? I ask because it seems like it filters after the pagination, and it messes up with the pages

like image 465
sainoba Avatar asked May 30 '20 21:05

sainoba


1 Answers

FaunaDB provides a lot of constructs, this makes it powerful but you have a lot to choose from. With great power comes a small learning curve :).

How to read the code samples

To be clear, I use the JavaScript flavor of FQL here and typically expose the FQL functions from the JavaScript driver as follows:

const faunadb = require('faunadb')
const q = faunadb.query
const {
  Not,
  Abort,
  ...
} = q

You do have to be careful to export Map like that since it will conflict with JavaScripts map. In that case, you could just use q.Map.

Option 1: using ContainsStr() & Filter

Basic usage according to the docs

ContainsStr('Fauna', 'a')

Of course, this works on a specific value so in order to make it work you need Filter and Filter only works on paginated sets. That means that we first need to get a paginated set. One way to get a paginated set of documents is:

q.Map(
  Paginate(Documents(Collection('tasks'))),
  Lambda(['ref'], Get(Var('ref')))
)

But we can do that more efficiently since one get === one read and we don't need the docs, we'll be filtering out a lot of them. It's interesting to know that one index page is also one read so we can define an index as follows:

{
  name: "tasks_name_and_ref",
  unique: false,
  serialized: true,
  source: "tasks",
  terms: [],
  values: [
    {
      field: ["data", "name"]
    },
    {
      field: ["ref"]
    }
  ]
}

And since we added name and ref to the values, the index will return pages of name and ref which we can then use to filter. We can, for example, do something similar with indexes, map over them and this will return us an array of booleans.

Map(
  Paginate(Match(Index('tasks_name_and_ref'))),
  Lambda(['name', 'ref'], ContainsStr(Var('name'), 'first'))
)

Since Filter also works on arrays, we can actually simple replace Map with filter. We'll also add a to lowercase to ignore casing and we have what we need:

Filter(
  Paginate(Match(Index('tasks_name_and_ref'))),
  Lambda(['name', 'ref'], ContainsStr(LowerCase(Var('name')), 'first'))
)

In my case, the result is:


{
  "data": [
    [
      "Firstly, we'll have to go and refactor this!",
      Ref(Collection("tasks"), "267120709035098631")
    ],
    [
      "go to a big rock-concert abroad, but let's not dive in headfirst",
      Ref(Collection("tasks"), "267120846106001926")
    ],
    [
      "The first thing to do is dance!",
      Ref(Collection("tasks"), "267120677201379847")
    ]
  ]
}

Filter and reduced page sizes

As you mentioned, this is not exactly what you want since it also means that if you request pages of 500 in size, they might be filtered out and you might end up with a page of size 3, then one of 7. You might think, why can't I just get my filtered elements in pages? Well, it's a good idea for performance reasons since it basically checks each value. Imagine you have a massive collection and filter out 99.99 percent. You might have to loop over many elements to get to 500 which all cost reads. We want pricing to be predictable :).

Option 2: indexes!

Each time you want to do something more efficient, the answer lies in indexes. FaunaDB provides you with the raw power to implement different search strategies but you'll have to be a bit creative and I'm here to help you with that :).

Bindings

In Index bindings, you can transform the attributes of your document and in our first attempt we will split the string into words (I'll implement multiple since I'm not entirely sure which kind of matching you want)

We do not have a string split function but since FQL is easily extended, we can write it ourselves bind to a variable in our host language (in this case javascript), or use one from this community-driven library: https://github.com/shiftx/faunadb-fql-lib

function StringSplit(string: ExprArg, delimiter = " "){
    return If(
        Not(IsString(string)),
        Abort("SplitString only accept strings"),
        q.Map(
            FindStrRegex(string, Concat(["[^\\", delimiter, "]+"])),
            Lambda("res", LowerCase(Select(["data"], Var("res"))))
        )
    )
)

And use it in our binding.

CreateIndex({
  name: 'tasks_by_words',
  source: [
    {
      collection: Collection('tasks'),
      fields: {
        words: Query(Lambda('task', StringSplit(Select(['data', 'name']))))
      }
    }
  ],
  terms: [
    {
      binding: 'words'
    }
  ]
})

Hint, if you are not sure whether you have got it right, you can always throw the binding in values instead of terms and then you'll see in the fauna dashboard whether your index actually contains values:

enter image description here What did we do? We just wrote a binding that will transform the value into an array of values at the time a document is written. When you index the array of a document in FaunaDB, these values are indexes separately yet point all to the same document which will be very useful for our search implementation.

We can now find tasks that contain the string 'first' as one of their words by using the following query:

q.Map(
  Paginate(Match(Index('tasks_by_words'), 'first')),
  Lambda('ref', Get(Var('ref')))
)

Which will give me the document with name: "The first thing to do is dance!"

The other two documents didn't contain the exact words, so how do we do that?

Option 3: indexes and Ngram (exact contains matching)

To get exact contains matching efficient, you need to use a (still undocumented function since we'll make it easier in the future) function called 'NGram'. Dividing a string in ngrams is a search technique that is often used underneath the hood in other search engines. In FaunaDB we can easily apply it as due to the power of the indexes and bindings. The Fwitter example has an example in it's source code that does autocompletion. This example won't work for your use-case but I do reference it for other users since it's meant for autocompleting short strings, not to search a short string in a longer string like a task.

We'll adapt it though for your use-case. When it comes to searching it's all a tradeoff of performance and storage and in FaunaDB users can choose their tradeoff. Note that in the previous approach, we stored each word separately, with Ngrams we'll split words even further to provide some form of fuzzy matching. The downside is that the index size might become very big if you make the wrong choice (this is equally true for search engines, hence why they let you define different algorithms).

What NGram essentially does is get substrings of a string of a certain length. For example:

NGram('lalala', 3, 3)

Will return:

enter image description here

If we know that we won't be searching for strings longer than a certain length, let's say length 10 (it's a tradeoff, increasing the size will increase the storage requirements but allow you to do query for longer strings), you can write the following Ngram generator.

function GenerateNgrams(Phrase) {
  return Distinct(
    Union(
      Let(
        {
          // Reduce this array if you want less ngrams per word.
          indexes: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
          indexesFiltered: Filter(
            Var('indexes'),
            // filter out the ones below 0
            Lambda('l', GT(Var('l'), 0))
          ),
          ngramsArray: q.Map(Var('indexesFiltered'), Lambda('l', NGram(LowerCase(Var('Phrase')), Var('l'), Var('l'))))
        },
        Var('ngramsArray')
      )
    )
  )
}

You can then write your index as followed:

CreateIndex({
  name: 'tasks_by_ngrams_exact',
  // we actually want to sort to get the shortest word that matches first
  source: [
    {
      // If your collections have the same property tht you want to access you can pass a list to the collection
      collection: [Collection('tasks')],
      fields: {
        wordparts: Query(Lambda('task', GenerateNgrams(Select(['data', 'name'], Var('task')))))
      }
    }
  ],
  terms: [
    {
      binding: 'wordparts'
    }
  ]
})

And you have an index backed search where your pages are the size you requested.

q.Map(
  Paginate(Match(Index('tasks_by_ngrams_exact'), 'first')),
  Lambda('ref', Get(Var('ref')))
)

Option 4: indexes and Ngrams of size 3 or trigrams (Fuzzy matching)

If you want fuzzy searching, often trigrams are used, in this case our index will be easy so we're not going to use an external function.

CreateIndex({
  name: 'tasks_by_ngrams',
  source: {
    collection: Collection('tasks'),
    fields: {
      ngrams: Query(Lambda('task', Distinct(NGram(LowerCase(Select(['data', 'name'], Var('task'))), 3, 3))))
    }
  },
  terms: [
    {
      binding: 'ngrams'
    }
  ]
})

If we would place the binding in values again to see what comes out we'll see something like this: enter image description here In this approach, we use both trigrams on the indexing side as on the querying side. On the querying side, that means that the 'first' word which we search for will also be divided in Trigrams as follows:

enter image description here

For example, we can now do a fuzzy search as follows:

q.Map(
  Paginate(Union(q.Map(NGram('first', 3, 3), Lambda('ngram', Match(Index('tasks_by_ngrams'), Var('ngram')))))),
  Lambda('ref', Get(Var('ref')))
)

In this case, we do actually 3 searches, we are searching for all of the trigrams and union the results. Which will return us all sentences that contain first. enter image description here

But if we would have miss-spelled it and would have written frst we would still match all three since there is a trigram (rst) that matches.

like image 182
Brecht De Rooms Avatar answered Sep 30 '22 17:09

Brecht De Rooms