Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

given 10 functions y=a+bx and 1000's of (x,y) data points rounded to ints, how to derive 10 best (a,b) tuples?

We build software that audits fees charged by banks to merchants that accept credit and debit cards. Our customers want us to tell them if the card processor is overcharging them. Per-transaction credit card fees are calculated like this:

fee = fixed + variable*transaction_price

A "fee scheme" is the pair of (fixed, variable) used by a group of credit cards, e.g. "MasterCard business debit gold cards issued by First National Bank of Hollywood". We believe there are fewer than 10 different fee schemes in use at any time, but we aren't getting a complete nor current list of fee schemes from our partners. (yes, I know that some "fee schemes" are more complicated than the equation above because of caps and other gotchas, but our transactions are known to have only a + bx schemes in use).

Here's the problem we're trying to solve: we want to use per-transaction data about fees to derive the fee schemes in use. Then we can compare that list to the fee schemes that each customer should be using according to their bank.

The data we get about each transaction is a data tuple: (card_id, transaction_price, fee).

transaction_price and fee are in integer cents. The bank rolls over fractional cents for each transation until the cumulative is greater than one cent, and then a "rounding cent" will be attached to the fees of that transaction. We cannot predict which transaction the "rounding cent" will be attached to.

card_id identifies a group of cards that share the same fee scheme. In a typical day of 10,000 transactions, there may be several hundred unique card_id's. Multiple card_id's will share a fee scheme.

The data we get looks like this, and what we want to figure out is the last two columns.

card_id    transaction_price       fee        fixed        variable
=======================================================================
12345      200                     22         ?            ?
67890      300                     21         ?            ?
56789      150                      8         ?            ?
34567      150                      8         ?            ?
34567      150    "rounding cent"-> 9         ?            ?
34567      150                      8         ?            ?

The end result we want is a short list like this with 10 or fewer entries showing the fee schemes that best fit our data. Like this:

fee_scheme_id       fixed     variable
======================================
1                      22            0
2                      21            0
3                       ?            ?
4                       ?            ?
...

The average fee is about 8 cents. This means the rounding cents have a huge impact and the derivation above requires a lot of data.

The average transaction is 125 cents. Transaction prices are always on 5-cent boundaries.

We want a short list of fee schemes that "fit" 98%+ of the 3,000+ transactions each customer gets each day. If that's not enough data to achieve 98% confidence, we can use multiple days' of data.

Because of the rounding cents applied somewhat arbitrarily to each transaction, this isn't a simple algebra problem. Instead, it's a kind of statistical clustering exercise that I'm not sure how to solve.

Any suggestions for how to approach this problem? The implementation can be in C# or T-SQL, whichever makes the most sense given the algorithm.

like image 263
Justin Grant Avatar asked Dec 22 '11 19:12

Justin Grant


3 Answers

Hough transform

Consider your problem in image terms: If you would plot your input data on a diagram of price vs. fee, each scheme's entries would form a straight line (with rounding cents being noise). Consider the density map of your plot as an image, and the task is reduced to finding straight lines in an image. Which is just the job of the Hough transform.

You would essentially approach this by plotting one line for each transaction into a diagram of possible fixed fee versus possible variable fee, adding the values of lines where they cross. At the points of real fee schemes, many lines will intersect and form a large local maximum. By detecting this maximum, you find your fee scheme, and even a degree of importance for the fee scheme.

This approach will surely work, but might take some time depending on the resolution you want to achieve. If computation time proves to be an issue, remember that a Voronoi diagram of a coarse Hough space can be used as a classificator - and once you have classified your points into fee schemes, simple linear regression solves your problem.

like image 75
thiton Avatar answered Oct 02 '22 12:10

thiton


Considering, that a processing query's storage requirements are in the same power of 2 as a day's worth of transaction data, I assume that such storage is not a problem, so:

  • First pass: Group the transactions for each card_id by transaction_price, keeping card_id, transaction_price and average fee. This can easily be done in SQL. This assumes, there are not outliers - but you can catch those at after this stage if so required. The resulting number of rows is guaranteed to be no higher than the number of raw data points.

  • Second pass: Per group walk these new data points (with a cursor or in C#) and calculate the average value of b. Again any outliers can be caught if desired after this stage.

  • Third pass: Per group calculate the average value of a, now that b is known. This is basic SQL. Outliers as allways

If you decide to do the second step in a cursor you can stuff all that into a stored procedure.

Different card_id groups, that use the same fee scheme can now be coalesced (Sorry of this is the wrong word, non-english native) into fee schemes by rounding a and b with a sane precision and again grouping.

like image 21
Eugen Rieck Avatar answered Oct 02 '22 12:10

Eugen Rieck


The Hough transform is the most general answer, though I don't know how one would implement it in SQL (rather than pulling the data out and processing it in a general purpose language of your choice).

Alas, the naive version is known to be slow if you have a lot of input data (1000 points is kinda medium sized) and if you want high precision results (scales as size_of_the_input / (rho_precision * theta_precision)).

There is a faster approach based on 2^n-trees, but there are few implementations out on the web to just plug in. (I recently did one in C++ as a testbed for a project I'm involved in. Maybe I'll clean it up and post it somewhere.)


If there is some additional order to the data you may be able to do better (i.e. do the line segments form a piecewise function?).


Naive Hough transform

Define an accumulator in (theta,rho) space spanning [-pi,pi) and [0,max(hypotenuse(x,y)] as an 2D-array.

Foreach point in the input data
   Foreach bin in theta
      find the distance rho of the altitude from the origin to 
      a line through (a,y) and making angle theta with the horizontal
      rho = x cos(theta) + y sin(theta)
      and increment the bin (theta,rho) in the accumulator
Find the maximum bin in the accumulator, this 
represents the most line-like structure in the data
if (theta !=0) {a = rho/sin(theta); b = -1/tan(theta);}

Reliably getting multiple lines out of a single pass takes a little more bookkeeping, but it is not significantly harder.

You can improve the result a little by smoothing the data near the candidate peaks and fitting to get sub-bin precision which should be faster than using smaller bins and should pickup the effect of the "rounding" cents fairly smoothly.

like image 42
dmckee --- ex-moderator kitten Avatar answered Oct 02 '22 14:10

dmckee --- ex-moderator kitten