Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to locate data-item position in the hierarchy?

Tags:

algorithm

I need to develop an algorithm that can locate data item position in some hierarchy. I have hierarchy that classifies elements of some dataset. Hierarchy is taxonomic - top element is the most generic class, that matches any element of the dataset, the deeper elements contain more specific classes that match some subset of the dataset.

For example, consider hierarchy of yachts. We have class Yacht at the top. In the next level we have Sailing yacht and Motor yacht. Sailing yacht has two children - Cruising yacht and Racing yacht. Cruisers can be further divided by manufacturer, for example Bavaria Yachts and Dufour Yachts. Then each of this classes can be further divided by the hull type, length, sails area and so on.

This is an example from the dataset:

Drive   Class   Manufacturer   Hull type Len  Sails Area ... Model
Sailing Cruiser Bavaria Yachts Mono-hull 25ft 560sqft    ... Bavaria 32
Sailing Cruiser Dufour Yachts  Mono-hull 27ft 580sqft    ... Dufour 32 Classic

I can easily map each sample to hierarchy by searching it in depth-first order.

It is a simple search problem at first glance but there are some difficulties.

First difficulty: data items doesn't necessary contain all the elements. It's common that data item lacks from 10 to 50 percent of elements. Many of this elements is not very significant, for example yacht Drive can only be Motor or Sail so it doesn't bring a lot of information (only 1 bit). These elements can be inferred easily using the more significant elements, for example if we know yacht Model, we can infer all other elements(or fields) of the data-item.

Second difficulty: some elements can vary between different data items even if they correspond to the same place in the hierarchy (same yacht model). For example Sails area can vary greatly because boat owners modify they yacht's rig in a different ways or just round area value.

As I've already mentioned, I need to locate different data items from the dataset in the hierarchy. Each data item can be located with different precision. Precision is a depth in the hierarchy at which search process stops. In other words, I need to get path in the hierarchy that corresponds to each data item and this path can be incomplete. For example, algorithm can find that data items corresponds to Juliet 23 yacht but production year can still be unknown.

It would be cool if I could get multiple paths with probability measure for each. For example, algorithm can return 4 paths for Juliet 23 for different production years, each with 25% probability.

At this moment I solve this problem using depth first search with some heuristics. It gives good results but I think that it is possible to get better results. Maybe you can formulate this problem in more generic way so I can search for some academic papers about it.

like image 565
Evgeny Lazin Avatar asked Nov 13 '22 13:11

Evgeny Lazin


1 Answers

I think SQL can really help you resolve your difficulties,

For your First difficulty: use NVL(field, value-if-null)

Example: Print type & production year (if it exist), of racing yachts

SELECT Y.TYPE, NVL(Y.PRDYEAR, 'UNKNOWN')
FROM T_YACHT Y WHERE Y.CLASS = 'RACING'

Example: get all Yachts which Production Year is over year 2000

SELECT * FROM T_YACHT Y WHERE
NVL(Y.PRDYEAR,TO_TIMESTAMP('01-01-0001','DD-MM-YYYY'))
    > TO_TIMESTAMP('01-01-2000','DD-MM-YYYY')

For your Second difficulty: use GROUP BY\CASCADING-SQL\DISTINCT\NVL

Example: see how many types of racing yachts are there

SELECT Y.TYPE, COUNT(Y.ID) AS YACHT_TYPE
FROM T_YACHT Y
WHERE Y.CLASS = 'RACING'
GROUP BY Y.TYPE
like image 170
Khaled.K Avatar answered Nov 15 '22 06:11

Khaled.K