Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ontology-based string classification

I recently started working with ontologies and I am using Protege to build an ontology which I'd also like to use for automatically classifying strings. The following illustrates a very basic class hierarchy:

String
|_ AlphabeticString
   |_ CountryName
   |_ CityName
|_ AlphaNumericString
   |_ PrefixedNumericString
|_ NumericString

Eventually strings like Spain should be classified as CountryName or UE4564 would be a PrefixedNumericString.

However I am not sure how to model this knowledge. Would I have to first define if a character is alphabetic, numeric, etc. and then construct a word from the existing characters or is there a way to use Regexes? So far I only managed to classify strings based on an exact phrase like String and hasString value "UE4565".

Or would it be better to safe a regex for each class in the ontology and then classify the string in Java using those regexes?

like image 895
Pedro Avatar asked Mar 06 '12 12:03

Pedro


People also ask

What is ontology-based classification?

Ontology-based document classification involves determining document features that represent the Web documents most accurately, and classifying them into the most appropriate categories after analyzing their contents by using at least two pre-defined categories per given document features.

Can ontologies be used for text data?

The knowledge represented in a comprehensive ontology can be used to identify concepts in a text. Furthermore, if the concepts in the ontology are organized into hierarchies of higher-level categories, it should be possible to identify the category that best classify the content of the text.

Which algorithm is best for text classification?

Linear Support Vector Machine is widely regarded as one of the best text classification algorithms.

What is an ontology in machine learning?

An ontology is a set of concepts and categories in a subject area or domain that possesses the properties and relations between them. Ontological Modeling can help the cognitive AI or machine learning model by broadening its' scope. They can include any data type or variation and set each diver data to a specific task.


2 Answers

An approach that might be appropriate here, especially if the ontology is large/complicated or might change in the future, and assuming that some errors are acceptable, is machine learning.

An outline of a process utilizing this approach might be:

  1. Define a feature set you can extract from each string, relating to your ontology (some examples below).
  2. Collect a "train set" of strings and their true matching categories.
  3. Extract features from each string, and train some machine-learning algorithm on this data.
  4. Use the trained model to classify new strings.
  5. Retrain or update your model as needed (e.g. when new categories are added).

To illustrate more concretely, here are some suggestions based on your ontology example.

Some boolean features that might be applicable: does the string matches a regexp (e.g the ones Qtax suggests); does the string exist in a prebuilt known city-names list; does it exist in a known country-names list; existence of uppercase letters; string length (not boolean), etc.

So if, for instance, you have a total of 8 features: match to the 4 regular expressions mentioned above; and the additional 4 suggested here, then "Spain" would be represented as (1,1,0,0,1,0,1,5) (matching the first 2 regular expressions but not the last two, is a city name but not a country name, has an uppercase letter and length is 5).

This set of feature will represent any given string.

to train and test a machine learning algorithm, you can use WEKA. I would start from rule or tree based algorithms, e.g. PART, RIDOR, JRIP or J48.

Then the trained models can be used via Weka either from within Java or as an external command line.

Obviously, the features I suggest have almost 1:1 match with your Ontology, but assuming your taxonomy is larger and more complex, this approach would probably be one of the best in terms of cost-effectiveness.

like image 188
etov Avatar answered Nov 16 '22 04:11

etov


I don't know anything about Protege, but you can use regex to match most of those cases. The only problem would be differentiating between country and city name, I don't see how you could do that without a complete list of either one.

Here are some expressions that you could use:

  • AlphabeticString:

    ^[A-Za-z]+\z (ASCII) or ^\p{Alpha}+\z (Unicode)

  • AlphaNumericString:

    ^[A-Za-z0-9]+\z (ASCII) or ^\p{Alnum}+\z (Unicode)

  • PrefixedNumericString:

    ^[A-Za-z]+[0-9]+\z (ASCII) or ^\p{Alpha}+\p{N}+\z (Unicode)

  • NumericString:

    ^[0-9]+\z (ASCII) or ^\p{N}+\z (Unicode)

like image 25
Qtax Avatar answered Nov 16 '22 02:11

Qtax