I am implementing a caching layer between my database and my C# code. The idea is to cache the results of certain DB queries based on the parameters to the query. The database is using the default collation - either SQL_Latin1_General_CP1_CI_AS
or Latin1_General_CI_AS
, which I believe based on some brief googling are equivalent for equality, just different for sorting.
I need a .NET StringComparer that can give me the same behavior, at least for equality testing and hashcode generation, as the database's collation is using. The goal is to be able to use the StringComparer in a .NET dictionary in C# code to determine whether a particular string key is already in the cache or not.
A really simplified example:
var comparer = StringComparer.??? // What goes here?
private static Dictionary<string, MyObject> cache =
new Dictionary<string, MyObject>(comparer);
public static MyObject GetObject(string key) {
if (cache.ContainsKey(key)) {
return cache[key].Clone();
} else {
// invoke SQL "select * from mytable where mykey = @mykey"
// with parameter @mykey set to key
MyObject result = // object constructed from the sql result
cache[key] = result;
return result.Clone();
}
}
public static void SaveObject(string key, MyObject obj) {
// invoke SQL "update mytable set ... where mykey = @mykey" etc
cache[key] = obj.Clone();
}
The reason it's important that the StringComparer matches the database's collation is that both false positives and false negatives would have bad effects for the code.
If the StringComparer says that two keys A and B are equal when the database believes they are distinct, then there could be two rows in the database with those two keys, but the cache will prevent the second one ever getting returned if asked for A and B in succession - because the get for B will incorrectly hit the cache and return the object that was retrieved for A.
The problem is more subtle if the StringComparer says that A and B are different when the database believes they are equal, but no less problematic. GetObject calls for both keys would be fine, and return objects corresponding to the same database row. But then calling SaveObject with key A would leave the cache incorrect; there would still be a cache entry for key B that has the old data. A subsequent GetObject(B) would give outdated information.
So for my code to work correctly I need the StringComparer to match the database behavior for equality testing and hashcode generation. My googling so far has yielded lots of information about the fact that SQL collations and .NET comparisons are not exactly equivalent, but no details on what the differences are, whether they are limited to only differences in sorting, or whether it is possible to find a StringComparer that is equivalent to a specific SQL collation if a general-purpose solution is not needed.
(Side note - the caching layer is general purpose, so I cannot make particular assumptions about what the nature of the key is and what collation would be appropriate. All the tables in my database share the same default server collation. I just need to match the collation as it exists)
The following is much simpler:
System.Globalization.CultureInfo.GetCultureInfo(1033)
.CompareInfo.GetStringComparer(CompareOptions.IgnoreCase | CompareOptions.IgnoreKanaType | CompareOptions.IgnoreWidth)
It comes from https://docs.microsoft.com/en-us/dotnet/api/system.globalization.globalizationextensions?view=netframework-4.8
It computes the hashcode correctly given the options given. You'll still have to trim trailing spaces manually, as they as discarded by ANSI sql but not in .net
Here is a wrapper that trims spaces.
using System.Collections.Generic;
using System.Globalization;
namespace Wish.Core
{
public class SqlStringComparer : IEqualityComparer<string>
{
public static IEqualityComparer<string> Instance { get; }
private static IEqualityComparer<string> _internalComparer =
CultureInfo.GetCultureInfo(1033)
.CompareInfo
.GetStringComparer(CompareOptions.IgnoreCase | CompareOptions.IgnoreKanaType | CompareOptions.IgnoreWidth);
private SqlStringComparer()
{
}
public bool Equals(string x, string y)
{
//ANSI sql doesn't consider trailing spaces but .Net does
return _internalComparer.Equals(x?.TrimEnd(), y?.TrimEnd());
}
public int GetHashCode(string obj)
{
return _internalComparer.GetHashCode(obj?.TrimEnd());
}
static SqlStringComparer()
{
Instance = new SqlStringComparer();
}
}
}
I've recently faced with the same problem: I need an IEqualityComparer<string>
that behaves in SQL-like style. I've tried CollationInfo
and its EqualityComparer
. If your DB is always _AS (accent sensitive) then your solution will work, but in case if you change the collation that is AI or WI or whatever "insensitive" else the hashing will break.
Why? If you decompile Microsoft.SqlServer.Management.SqlParser.dll and look inside you'll find out that CollationInfo
internally uses CultureAwareComparer.GetHashCode
(it's internal class of mscorlib.dll) and finally it does the following:
public override int GetHashCode(string obj)
{
if (obj == null)
throw new ArgumentNullException("obj");
CompareOptions options = CompareOptions.None;
if (this._ignoreCase)
options |= CompareOptions.IgnoreCase;
return this._compareInfo.GetHashCodeOfString(obj, options);
}
As you can see it can produce the same hashcode for "aa" and "AA", but not for "äå" and "aa" (which are the same, if you ignore diacritics (AI) in majority of cultures, so they should have the same hashcode). I don't know why the .NET API is limited by this, but you should understand where the problem can come from.
To get the same hashcode for strings with diacritics you can do the following: create implementation of IEqualityComparer<T>
implementing the GetHashCode
that will call appropriate CompareInfo
's object's GetHashCodeOfString
via reflection because this method is internal and can't be used directly. But calling it directly with correct CompareOptions
will produce the desired result:
See this example:
static void Main(string[] args)
{
const string outputPath = "output.txt";
const string latin1GeneralCiAiKsWs = "Latin1_General_100_CI_AI_KS_WS";
using (FileStream fileStream = File.Open(outputPath, FileMode.Create, FileAccess.Write))
{
using (var streamWriter = new StreamWriter(fileStream, Encoding.UTF8))
{
string[] strings = { "aa", "AA", "äå", "ÄÅ" };
CompareInfo compareInfo = CultureInfo.GetCultureInfo(1033).CompareInfo;
MethodInfo GetHashCodeOfString = compareInfo.GetType()
.GetMethod("GetHashCodeOfString",
BindingFlags.Instance | BindingFlags.NonPublic,
null,
new[] { typeof(string), typeof(CompareOptions), typeof(bool), typeof(long) },
null);
Func<string, int> correctHackGetHashCode = s => (int)GetHashCodeOfString.Invoke(compareInfo,
new object[] { s, CompareOptions.IgnoreCase | CompareOptions.IgnoreNonSpace, false, 0L });
Func<string, int> incorrectCollationInfoGetHashCode =
s => CollationInfo.GetCollationInfo(latin1GeneralCiAiKsWs).EqualityComparer.GetHashCode(s);
PrintHashCodes(latin1GeneralCiAiKsWs, incorrectCollationInfoGetHashCode, streamWriter, strings);
PrintHashCodes("----", correctHackGetHashCode, streamWriter, strings);
}
}
Process.Start(outputPath);
}
private static void PrintHashCodes(string collation, Func<string, int> getHashCode, TextWriter writer, params string[] strings)
{
writer.WriteLine(Environment.NewLine + "Used collation: {0}", collation + Environment.NewLine);
foreach (string s in strings)
{
WriteStringHashcode(writer, s, getHashCode(s));
}
}
The output is:
Used collation: Latin1_General_100_CI_AI_KS_WS
aa, hashcode: 2053722942
AA, hashcode: 2053722942
äå, hashcode: -266555795
ÄÅ, hashcode: -266555795
Used collation: ----
aa, hashcode: 2053722942
AA, hashcode: 2053722942
äå, hashcode: 2053722942
ÄÅ, hashcode: 2053722942
I know it looks like the hack, but after inspecting decompiled .NET code I'm not sure if there any other option in case the generic functionality is needed.
So be sure that you'll not fall into trap using this not fully correct API.
UPDATE:
I've also created the gist with potential implementation of "SQL-like comparer" using CollationInfo
.
Also there should be paid enough attention where to search for "string pitfalls" in your code base, so if the string comparison, hashcode, equality should be changed to "SQL collation-like" those places are 100% will be broken, so you'll have to find out and inspect all the places that can be broken.
UPDATE #2:
There is better and cleaner way to make GetHashCode() treat CompareOptions. There is the class SortKey that works correctly with CompareOptions and it can be retrieved using
CompareInfo.GetSortKey(yourString, yourCompareOptions).GetHashCode()
Here is the link to .NET source code and implementation.
UPDATE #3:
If you're on .NET Framework 4.7.1+ you should use new GlobalizationExtensions
class as proposed by this recent answer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With