Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Case sensitive/insensitive comparisons for Data.Text?

I often need to do comparisons of Data.Text values with differing requirements for case sensitivity - this comes up frequently when I'm using chatter for NLP tasks.

For example, when searching tokens for Information Extraction tasks, I frequently need to search based on equality relationships that are less restrictive than standard string equality. Case sensitivity is the most common of those changes, but it's often a function of the specific token. A term like "activate" might usually be lower case, but if it's the first word in a sentence, it'll start with a leading capital, or if used in a title text may be in all caps or capitalized mid-sentence, so comparisons that ignore case make sense. Conversely, an acronym (e.g., "US") has different semantics depending on the capitalization.

That's all to say that I can't easily create a typeclass wrapper for each equality class, since it's a value-driven aspect. (so the case-insensitive package doesn't look like it would work).

So far, I'm using toLower to make a canonical representation, and comparing those representations so I can create custom versions of Text comparison functions that take a sensitivity flag, e.g.:

matches :: CaseSensitive -> Text -> Text -> Bool
matches Sensitive   x y = x == y
matches Insensitive x y = (T.toLower x) == (T.toLower y)

However, I'm concerned that this takes extra passes over the input text. I could imagine it fusing in some cases, but probably not all (eg: T.isSuffixOf, T.isInfixOf).

Is there a better way to do this?

like image 483
rcreswick Avatar asked Oct 31 '22 21:10

rcreswick


1 Answers

If the style of the comparison is driven by the semantics of the thing being compared, does it make sense to pass those semantics be passed around with the actual text? You can also then normalise where appropriate to avoid the repeated passes later:

data Token = Token CaseSensitive Text -- Text is all lower-case if Insensitive
    deriving Eq

and perhaps define a smart constructor:

token Sensitive t = Token Sensitive t
token Insensitive t = Token Insensitive (T.toLower t)

This implies that the acronym "US" won't ever compare equal to the word "us", but that seems logical anyway.

You might also tag the values with something more detailed like acronym/word/... rather than just Sensitive/Insensitive.

like image 142
GS - Apologise to Monica Avatar answered Nov 13 '22 02:11

GS - Apologise to Monica