I recently started working with ontologies and I am using Protege to build an ontology which I'd also like to use for automatically classifying strings. The following illustrates a very basic class hierarchy:
String
|_ AlphabeticString
|_ CountryName
|_ CityName
|_ AlphaNumericString
|_ PrefixedNumericString
|_ NumericString
Eventually strings like Spain
should be classified as CountryName
or UE4564
would be a PrefixedNumericString
.
However I am not sure how to model this knowledge. Would I have to first define if a character is alphabetic, numeric, etc. and then construct a word from the existing characters or is there a way to use Regexes? So far I only managed to classify strings based on an exact phrase like String and hasString value "UE4565"
.
Or would it be better to safe a regex for each class in the ontology and then classify the string in Java using those regexes?
Ontology-based document classification involves determining document features that represent the Web documents most accurately, and classifying them into the most appropriate categories after analyzing their contents by using at least two pre-defined categories per given document features.
The knowledge represented in a comprehensive ontology can be used to identify concepts in a text. Furthermore, if the concepts in the ontology are organized into hierarchies of higher-level categories, it should be possible to identify the category that best classify the content of the text.
Linear Support Vector Machine is widely regarded as one of the best text classification algorithms.
An ontology is a set of concepts and categories in a subject area or domain that possesses the properties and relations between them. Ontological Modeling can help the cognitive AI or machine learning model by broadening its' scope. They can include any data type or variation and set each diver data to a specific task.
An approach that might be appropriate here, especially if the ontology is large/complicated or might change in the future, and assuming that some errors are acceptable, is machine learning.
An outline of a process utilizing this approach might be:
To illustrate more concretely, here are some suggestions based on your ontology example.
Some boolean features that might be applicable: does the string matches a regexp (e.g the ones Qtax suggests); does the string exist in a prebuilt known city-names list; does it exist in a known country-names list; existence of uppercase letters; string length (not boolean), etc.
So if, for instance, you have a total of 8 features: match to the 4 regular expressions mentioned above; and the additional 4 suggested here, then "Spain" would be represented as (1,1,0,0,1,0,1,5) (matching the first 2 regular expressions but not the last two, is a city name but not a country name, has an uppercase letter and length is 5).
This set of feature will represent any given string.
to train and test a machine learning algorithm, you can use WEKA. I would start from rule or tree based algorithms, e.g. PART, RIDOR, JRIP or J48.
Then the trained models can be used via Weka either from within Java or as an external command line.
Obviously, the features I suggest have almost 1:1 match with your Ontology, but assuming your taxonomy is larger and more complex, this approach would probably be one of the best in terms of cost-effectiveness.
I don't know anything about Protege, but you can use regex to match most of those cases. The only problem would be differentiating between country and city name, I don't see how you could do that without a complete list of either one.
Here are some expressions that you could use:
AlphabeticString:
^[A-Za-z]+\z
(ASCII) or ^\p{Alpha}+\z
(Unicode)
AlphaNumericString:
^[A-Za-z0-9]+\z
(ASCII) or ^\p{Alnum}+\z
(Unicode)
PrefixedNumericString:
^[A-Za-z]+[0-9]+\z
(ASCII) or ^\p{Alpha}+\p{N}+\z
(Unicode)
NumericString:
^[0-9]+\z
(ASCII) or ^\p{N}+\z
(Unicode)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With