I recently started working with ontologies and I am using Protege to build an ontology which I'd also like to use for automatically classifying strings. The following illustrates a very basic class hierarchy: <pre class="prettyprint"><code>String |_ AlphabeticString |_ CountryName |_ CityName |_ AlphaNumericString |_ PrefixedNumericString |_ NumericString </code></pre> Eventually strings like <code>Spain</code> should be classified as <code>CountryName</code> or <code>UE4564</code> would be a <code>PrefixedNumericString</code>. However I am not sure how to model this knowledge. Would I have to first define if a character is alphabetic, numeric, etc. and then construct a word from the existing characters or is there a way to use Regexes? So far I only managed to classify strings based on an exact phrase like <code>String and hasString value "UE4565"</code>. Or would it be better to safe a regex for each class in the ontology and then classify the string in Java using those regexes?

I don't know anything about Protege, but you can use regex to match most of those cases. The only problem would be differentiating between country and city name, I don't see how you could do that without a complete list of either one. Here are some expressions that you could use: <ul> <li> AlphabeticString: <code>^[A-Za-z]+\z</code> (ASCII) or <code>^\p{Alpha}+\z</code> (Unicode) </li> <li> AlphaNumericString: <code>^[A-Za-z0-9]+\z</code> (ASCII) or <code>^\p{Alnum}+\z</code> (Unicode) </li> <li> PrefixedNumericString: <code>^[A-Za-z]+[0-9]+\z</code> (ASCII) or <code>^\p{Alpha}+\p{N}+\z</code> (Unicode) </li> <li> NumericString: <code>^[0-9]+\z</code> (ASCII) or <code>^\p{N}+\z</code> (Unicode) </li> </ul>

Ontology-based string classification

Tags:

string

regex

classification

ontology

protege

I recently started working with ontologies and I am using Protege to build an ontology which I'd also like to use for automatically classifying strings. The following illustrates a very basic class hierarchy:

String
|_ AlphabeticString
   |_ CountryName
   |_ CityName
|_ AlphaNumericString
   |_ PrefixedNumericString
|_ NumericString

Eventually strings like Spain should be classified as CountryName or UE4564 would be a PrefixedNumericString.

However I am not sure how to model this knowledge. Would I have to first define if a character is alphabetic, numeric, etc. and then construct a word from the existing characters or is there a way to use Regexes? So far I only managed to classify strings based on an exact phrase like String and hasString value "UE4565".

Or would it be better to safe a regex for each class in the ontology and then classify the string in Java using those regexes?

895

asked Mar 06 '12 12:03

Pedro

2 Answers

An approach that might be appropriate here, especially if the ontology is large/complicated or might change in the future, and assuming that some errors are acceptable, is machine learning.

An outline of a process utilizing this approach might be:

Define a feature set you can extract from each string, relating to your ontology (some examples below).
Collect a "train set" of strings and their true matching categories.
Extract features from each string, and train some machine-learning algorithm on this data.
Use the trained model to classify new strings.
Retrain or update your model as needed (e.g. when new categories are added).

To illustrate more concretely, here are some suggestions based on your ontology example.

Some boolean features that might be applicable: does the string matches a regexp (e.g the ones Qtax suggests); does the string exist in a prebuilt known city-names list; does it exist in a known country-names list; existence of uppercase letters; string length (not boolean), etc.

So if, for instance, you have a total of 8 features: match to the 4 regular expressions mentioned above; and the additional 4 suggested here, then "Spain" would be represented as (1,1,0,0,1,0,1,5) (matching the first 2 regular expressions but not the last two, is a city name but not a country name, has an uppercase letter and length is 5).

This set of feature will represent any given string.

to train and test a machine learning algorithm, you can use WEKA. I would start from rule or tree based algorithms, e.g. PART, RIDOR, JRIP or J48.

Then the trained models can be used via Weka either from within Java or as an external command line.

Obviously, the features I suggest have almost 1:1 match with your Ontology, but assuming your taxonomy is larger and more complex, this approach would probably be one of the best in terms of cost-effectiveness.

188

answered Nov 16 '22 04:11

etov

I don't know anything about Protege, but you can use regex to match most of those cases. The only problem would be differentiating between country and city name, I don't see how you could do that without a complete list of either one.

Here are some expressions that you could use:

AlphabeticString:

^[A-Za-z]+\z (ASCII) or ^\p{Alpha}+\z (Unicode)
AlphaNumericString:

^[A-Za-z0-9]+\z (ASCII) or ^\p{Alnum}+\z (Unicode)
PrefixedNumericString:

^[A-Za-z]+[0-9]+\z (ASCII) or ^\p{Alpha}+\p{N}+\z (Unicode)
NumericString:

^[0-9]+\z (ASCII) or ^\p{N}+\z (Unicode)

answered Nov 16 '22 02:11

Qtax

Related questions
                            
                                Why should we use re.purge() in python regular expression?
                            
                                Matching non-whitespace characters in Perl 6
                            
                                perl6 regex: match all punctuations except . and "
                            
                                Regex for extracting names starting with Mr.|Mrs|The|DR after honorable
                            
                                Regular Expression to escape HTML ampersands while respecting CDATA
                            
                                Javascript regular expressions - exec infinite loop
                            
                                Javascript Regex to convert dot notation to bracket notation
                            
                                How do I find {min,max} repeats with regular expression patterns in Visual Studio or SSMS "Find and Replace"?
                            
                                What is the preferred way to filter a regex search for duplicate matches in C#
                            
                                Is there a lib to generate data according to a regexp? (Python or other)
                            
                                regular expression to detect numbers written as words
                            
                                Javascript REGEX: How to get `1` and not `11`
                            
                                Regex to match on capital letter, digit or capital, lowercase, and digit
                            
                                Ruby Koans - Regex and .sub: Don't understand reason behind answer
                            
                                How can Python regex ignore case inside a part of a pattern but not the entire expression? [duplicate]
                            
                                Can you retrieve multiple regex matches in JavaScript?
                            
                                Regular Expressions C++ Qt
                            
                                How to remove ETX character from the end of a string? (Regex or PHP)
                            
                                Automatically built regex expressions that fit set of strings
                            
                                vim call function on a group in substitute string

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With