Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

String intersection using EF Core, Linq / SQL Server

I'm trying to find if there is any match of

  • array of input strings
  • with comma separated strings

stored inside SQL Server:

class Meeting
{
    public int Id { get; set; }
    public string? MeetingName { get; set; }
    public string DbColumnCommaSeparated { get; set; }
}

meetingQuery.Where(x => ArrayOfInputString
  .Any(y => x.DbColumnCommaSeparated.Split(",").Contains(y)))

Is it feasible to do it in an EF Core query using DbFunctions's and SQL STRING_SPLIT?

like image 642
Abhijeet Avatar asked Jan 18 '26 18:01

Abhijeet


1 Answers

What I can suggest with this particular database design is of course based on EF Core Mapping a queryable function to a table-valued function, similar to @GertArnold suggestion. However since the built-in SqlServer SPLIT_STRING function is already TVF, it can be mapped directly thus eliminating the need of custom db function and migration.

First, we'll need a simple keyless entity type (class) to hold the result, which according to the docs is a table with record having a single string column called "value":

[Keyless]
public class StringValue
{
    [Column("value")]
    public string Value { get; set; }
}

Next is the function itself. It could be defined as instance method f your db context, but I find it most appropriate to be a static method of some custom class, for instance called SqlFunctions:


public static class SqlFunctions
{
    [DbFunction(Name = "STRING_SPLIT", IsBuiltIn = true)]
    public static IQueryable<StringValue> Split(string source, string separator)
        => throw new NotSupportedException();
}

Note that this is just a "prototype" which never is supposed to be called (hence the throw in the "implementation") and just describes the traslation of actual db function call inside the query. Also all these attributes are arbitrary and the name, built-in etc. attributes can be configured fluently. I've put them here just for clarity and simplicity, since they are enough in this case and don't need the flexibility provided by fluent API.

Finally you have to register the db function for your model, by adding the following to the OnModelCreating override:

modelBuilder.HasDbFunction(() => SqlFunctions.Split(default, default));

The HasDbFunction overload used is the simplest and typesafe way of providing the information about your method using strongly typed expression rather than reflection.

And that's it. Now you can use

var query = db.Set<Meeting>()
    .Where(m => SqlFunctions.Split(m.DbColumnCommaSeparated, ",")
        .Any(e => ArrayOfInputString.Contains(e.Value)));

which will be translated to something like this:

SELECT [m].[Id], [m].[DbColumnCommaSeparated], [m].[MeetingName]
FROM [Meeting] AS [m]
WHERE EXISTS (
    SELECT 1
    FROM STRING_SPLIT([m].[DbColumnCommaSeparated], N',') AS [s]
    WHERE [s].[value] IN (N'a', N'b', N'c'))

with IN clause being different depending of the ArrayOfInputString content. I'm kind of surprised it does not get parameterized as in a "normal" Contains translation, but at least it gets translated to something which can be executed server side.

One thing to note is that you need to flip ArrayOfInputString and the split result set in the LINQ query, since there is another limitation of EF Core preventing translation of anything else but Contains method of in memory collections. Since here you are looking for intersection, it really doesn't matter which one is first, so putting the queryable first avoids that limitation.


Now that you have a solution for this db design, is it good or not. Well, this seems to be arbitrary and opinion based, but in general using normalized tables and joins is preferred, since it allows db query optimizers to use indexes and statistics when generating execution plans. Joins are very efficient since they almost always use efficient indexed scans, so in my (and most of the people) opinion you should not count them when doing the design. In this particular case though I'm not sure if normalized detail table with indexed single text value would produce better execution plan than the above (which due to the lack of information would do a full table scan evaluating the filter for each row), but it's worth trying, and I guess it won't be worse at least.

Also, all this applies to relational databases. Non relational databases can in fact contain "embedded" arrays or lists of values, which then can be used to store and process such data instead of comma separated string. In either cases, I would prefer normalized design, storing a list of values either as "embedded" or in related detail table instead of single comma separated string. But again, that's just e general opinion/preference. The single string (containing tags list for instance) is valid approach which may outperform the other for some operations, so you whatever is appropriate for you. Also note that SPLIT_STRING is not a standard db function, so in case you need to work with other database than SqlServer, you'll have a problem of finding similar function if it exists at all.

like image 91
Ivan Stoev Avatar answered Jan 21 '26 06:01

Ivan Stoev