I have a bunch of regression test data. Each test is just a list of messages (associative arrays), mapping message field names to values. There's a lot of repetition within this data.
For example
test1 = [
{ sender => 'client', msg => '123', arg => '900', foo => 'bar', ... },
{ sender => 'server', msg => '456', arg => '800', foo => 'bar', ... },
{ sender => 'client', msg => '789', arg => '900', foo => 'bar', ... },
]
I would like to represent the field data (as a minimal-depth decision tree?) so that each message can be programatically regenerated using a minimal number of parameters. For example, in the above
So I would like to be able to regenerate these messages with a program along the lines of
write_msg( 'client', '123' )
write_msg( 'server', '456' )
write_msg( 'client', '789' )
where the write_msg function would be composed of nested if statements or subfunction calls using the parameters.
Based on my original data, how can I determine the 'most important' set of parameters, i.e. the ones that will let me recreate my data set using the smallest number of arguments?
The minimum is the first number listed as it is the lowest, and the maximum is the last number listed because it is the highest.
A parameter is a number describing a whole population (e.g., population mean), while a statistic is a number describing a sample (e.g., sample mean). The goal of quantitative research is to understand characteristics of populations by finding parameters.
The min is simply the lowest observation, while the max is the highest observation. Obviously, it is easiest to determine the min and max if the data are ordered from lowest to highest. So for our data, the min is 13 and the max is 110.
The largest value in a data set is often called the maximum (or max for short), and the smallest value is called the minimum (or min). The difference between the maximum and minimum value is sometimes called the range and is calculated by subtracting the smallest value from the largest value.
The following papers describe algortithms for discovering functional dependencies:
Y. Huhtala, J. Kärkkäinen, P. Porkka, and H. Toivonen. TANE: An efficient algorithm for discovering functional and approximate dependencies. The Computer Journal, 42(2):100–111, 1999, doi:10.1093/comjnl/42.2.100.
I. Savnik and P. A. Flach. Bottom-up induction of functional dependencies from relations. In Proc. AAAI-93 Workshop: Knowledge Discovery in Databases, pages 174–185, Washington, DC, USA, 1993.
C. Wyss, C. Giannella, and E. Robertson. FastFDs: A Heuristic-Driven, Depth-First Algorithm for Mining Functional Dependencies from Relation Instances. In Proc. Data Warehousing and Knowledge Discovery, pages 101–110, Munich, Germany, 2001, doi:10.1007/3-540-44801-2.
Hong Yao and Howard J. Hamilton. "Mining functional dependencies from data." Data Mining and Knowledge Discovery, 2008, doi:10.1007/s10618-007-0083-9.
There has also been some work on discovering multivalued dependencies:
I. Savnik and P. A. Flach. "Discovery of Mutlivalued Dependencies from Relations." Intelligent Data Analysis Journal, 4(3):195–211, IOS Press, 2000.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With