Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to identify the minimal set of parameters describing a data set

Tags:

algorithm

I have a bunch of regression test data. Each test is just a list of messages (associative arrays), mapping message field names to values. There's a lot of repetition within this data.

For example

   test1 = [
      { sender => 'client',  msg => '123',  arg => '900', foo => 'bar', ... },
      { sender => 'server',  msg => '456',  arg => '800', foo => 'bar', ... },
      { sender => 'client',  msg => '789',  arg => '900', foo => 'bar', ... },
   ]

I would like to represent the field data (as a minimal-depth decision tree?) so that each message can be programatically regenerated using a minimal number of parameters. For example, in the above

  • foo is always 'bar', so I don't need to mention it
  • sender and client are correlated, so I only need to mention one or the other
  • and msg is different each time

So I would like to be able to regenerate these messages with a program along the lines of

write_msg( 'client', '123' )
write_msg( 'server', '456' )
write_msg( 'client', '789' )

where the write_msg function would be composed of nested if statements or subfunction calls using the parameters.

Based on my original data, how can I determine the 'most important' set of parameters, i.e. the ones that will let me recreate my data set using the smallest number of arguments?

like image 847
Eric Avatar asked Oct 03 '08 18:10

Eric


People also ask

How do you find the minimum of a set of data?

The minimum is the first number listed as it is the lowest, and the maximum is the last number listed because it is the highest.

How do you identify the parameter?

A parameter is a number describing a whole population (e.g., population mean), while a statistic is a number describing a sample (e.g., sample mean). The goal of quantitative research is to understand characteristics of populations by finding parameters.

What is the minimum or min in a data set?

The min is simply the lowest observation, while the max is the highest observation. Obviously, it is easiest to determine the min and max if the data are ordered from lowest to highest. So for our data, the min is 13 and the max is 110.

How do you find the minimum and maximum value in statistics?

The largest value in a data set is often called the maximum (or max for short), and the smallest value is called the minimum (or min). The difference between the maximum and minimum value is sometimes called the range and is calculated by subtracting the smallest value from the largest value.


1 Answers

The following papers describe algortithms for discovering functional dependencies:

Y. Huhtala, J. Kärkkäinen, P. Porkka, and H. Toivonen. TANE: An efficient algorithm for discovering functional and approximate dependencies. The Computer Journal, 42(2):100–111, 1999, doi:10.1093/comjnl/42.2.100.

I. Savnik and P. A. Flach. Bottom-up induction of functional dependencies from relations. In Proc. AAAI-93 Workshop: Knowledge Discovery in Databases, pages 174–185, Washington, DC, USA, 1993.

C. Wyss, C. Giannella, and E. Robertson. FastFDs: A Heuristic-Driven, Depth-First Algorithm for Mining Functional Dependencies from Relation Instances. In Proc. Data Warehousing and Knowledge Discovery, pages 101–110, Munich, Germany, 2001, doi:10.1007/3-540-44801-2.

Hong Yao and Howard J. Hamilton. "Mining functional dependencies from data." Data Mining and Knowledge Discovery, 2008, doi:10.1007/s10618-007-0083-9.

There has also been some work on discovering multivalued dependencies:

I. Savnik and P. A. Flach. "Discovery of Mutlivalued Dependencies from Relations." Intelligent Data Analysis Journal, 4(3):195–211, IOS Press, 2000.

like image 140
Vebjorn Ljosa Avatar answered Oct 15 '22 12:10

Vebjorn Ljosa