how to identify the minimal set of parameters describing a data set

Tags:

algorithm

I have a bunch of regression test data. Each test is just a list of messages (associative arrays), mapping message field names to values. There's a lot of repetition within this data.

For example

   test1 = [
      { sender => 'client',  msg => '123',  arg => '900', foo => 'bar', ... },
      { sender => 'server',  msg => '456',  arg => '800', foo => 'bar', ... },
      { sender => 'client',  msg => '789',  arg => '900', foo => 'bar', ... },
   ]

I would like to represent the field data (as a minimal-depth decision tree?) so that each message can be programatically regenerated using a minimal number of parameters. For example, in the above

foo is always 'bar', so I don't need to mention it
sender and client are correlated, so I only need to mention one or the other
and msg is different each time

So I would like to be able to regenerate these messages with a program along the lines of

write_msg( 'client', '123' )
write_msg( 'server', '456' )
write_msg( 'client', '789' )

where the write_msg function would be composed of nested if statements or subfunction calls using the parameters.

Based on my original data, how can I determine the 'most important' set of parameters, i.e. the ones that will let me recreate my data set using the smallest number of arguments?

847

asked Oct 03 '08 18:10

Eric

1 Answers

The following papers describe algortithms for discovering functional dependencies:

Y. Huhtala, J. Kärkkäinen, P. Porkka, and H. Toivonen. TANE: An efficient algorithm for discovering functional and approximate dependencies. The Computer Journal, 42(2):100–111, 1999, doi:10.1093/comjnl/42.2.100.

I. Savnik and P. A. Flach. Bottom-up induction of functional dependencies from relations. In Proc. AAAI-93 Workshop: Knowledge Discovery in Databases, pages 174–185, Washington, DC, USA, 1993.

C. Wyss, C. Giannella, and E. Robertson. FastFDs: A Heuristic-Driven, Depth-First Algorithm for Mining Functional Dependencies from Relation Instances. In Proc. Data Warehousing and Knowledge Discovery, pages 101–110, Munich, Germany, 2001, doi:10.1007/3-540-44801-2.

Hong Yao and Howard J. Hamilton. "Mining functional dependencies from data." Data Mining and Knowledge Discovery, 2008, doi:10.1007/s10618-007-0083-9.

There has also been some work on discovering multivalued dependencies:

I. Savnik and P. A. Flach. "Discovery of Mutlivalued Dependencies from Relations." Intelligent Data Analysis Journal, 4(3):195–211, IOS Press, 2000.

140

answered Oct 15 '22 12:10

Vebjorn Ljosa

Related questions
                            
                                Least Recently Used (LRU) Cache
                            
                                finding minimum number of rectangular pieces in a rectangular chocolate bar, with a rule
                            
                                Generating n binary vectors where each vector has a Hamming distance of d from every other vector
                            
                                Group array of items by their distinct id
                            
                                generate a random point within rectangles' areas uniformly (some rectangles could overlap)
                            
                                3Sum leetcode algorithm
                            
                                Peak finding algorithm in 2d-array with complexity O(n)
                            
                                Algorithm to find k smallest numbers in an array in same order using O(1) auxiliary space
                            
                                Any idea to optimise this algorithm?
                            
                                Sorted array except for first K and last K elements
                            
                                Sum of max elements in sub-triangles
                            
                                Building a tree recursively in JavaScript
                            
                                How to compute the min average sub-array better than O(n^2)? [duplicate]
                            
                                Progressively store the path from root node to node of multiway tree during insertion so that the storage operation does not have a complexity of O(n)
                            
                                Check if strings in a list can be formed by concatenation of elements in the same list
                            
                                Number of expressions of a given length
                            
                                Number of ways to change coins in constant time?
                            
                                Why is KNN so much faster with cosine distance than Euclidean distance?
                            
                                Are function parameters not polymorphic in Algorithm W (or Haskell)?
                            
                                Finding subarrays in an array where length equals P * (sum of elements)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With