We build software that audits fees charged by banks to merchants that accept credit and debit cards. Our customers want us to tell them if the card processor is overcharging them. Per-transaction credit card fees are calculated like this:
fee = fixed + variable*transaction_price
A "fee scheme" is the pair of (fixed, variable)
used by a group of credit cards, e.g. "MasterCard business debit gold cards issued by First National Bank of Hollywood". We believe there are fewer than 10 different fee schemes in use at any time, but we aren't getting a complete nor current list of fee schemes from our partners. (yes, I know that some "fee schemes" are more complicated than the equation above because of caps and other gotchas, but our transactions are known to have only a + bx
schemes in use).
Here's the problem we're trying to solve: we want to use per-transaction data about fees to derive the fee schemes in use. Then we can compare that list to the fee schemes that each customer should be using according to their bank.
The data we get about each transaction is a data tuple: (card_id, transaction_price, fee)
.
transaction_price
and fee
are in integer cents. The bank rolls over fractional cents for each transation until the cumulative is greater than one cent, and then a "rounding cent" will be attached to the fees of that transaction. We cannot predict which transaction the "rounding cent" will be attached to.
card_id
identifies a group of cards that share the same fee scheme. In a typical day of 10,000 transactions, there may be several hundred unique card_id
's. Multiple card_id
's will share a fee scheme.
The data we get looks like this, and what we want to figure out is the last two columns.
card_id transaction_price fee fixed variable
=======================================================================
12345 200 22 ? ?
67890 300 21 ? ?
56789 150 8 ? ?
34567 150 8 ? ?
34567 150 "rounding cent"-> 9 ? ?
34567 150 8 ? ?
The end result we want is a short list like this with 10 or fewer entries showing the fee schemes that best fit our data. Like this:
fee_scheme_id fixed variable
======================================
1 22 0
2 21 0
3 ? ?
4 ? ?
...
The average fee is about 8 cents. This means the rounding cents have a huge impact and the derivation above requires a lot of data.
The average transaction is 125 cents. Transaction prices are always on 5-cent boundaries.
We want a short list of fee schemes that "fit" 98%+ of the 3,000+ transactions each customer gets each day. If that's not enough data to achieve 98% confidence, we can use multiple days' of data.
Because of the rounding cents applied somewhat arbitrarily to each transaction, this isn't a simple algebra problem. Instead, it's a kind of statistical clustering exercise that I'm not sure how to solve.
Any suggestions for how to approach this problem? The implementation can be in C# or T-SQL, whichever makes the most sense given the algorithm.
Hough transform
Consider your problem in image terms: If you would plot your input data on a diagram of price vs. fee, each scheme's entries would form a straight line (with rounding cents being noise). Consider the density map of your plot as an image, and the task is reduced to finding straight lines in an image. Which is just the job of the Hough transform.
You would essentially approach this by plotting one line for each transaction into a diagram of possible fixed fee versus possible variable fee, adding the values of lines where they cross. At the points of real fee schemes, many lines will intersect and form a large local maximum. By detecting this maximum, you find your fee scheme, and even a degree of importance for the fee scheme.
This approach will surely work, but might take some time depending on the resolution you want to achieve. If computation time proves to be an issue, remember that a Voronoi diagram of a coarse Hough space can be used as a classificator - and once you have classified your points into fee schemes, simple linear regression solves your problem.
Considering, that a processing query's storage requirements are in the same power of 2 as a day's worth of transaction data, I assume that such storage is not a problem, so:
First pass: Group the transactions for each card_id by transaction_price, keeping card_id, transaction_price and average fee. This can easily be done in SQL. This assumes, there are not outliers - but you can catch those at after this stage if so required. The resulting number of rows is guaranteed to be no higher than the number of raw data points.
Second pass: Per group walk these new data points (with a cursor or in C#) and calculate the average value of b. Again any outliers can be caught if desired after this stage.
Third pass: Per group calculate the average value of a, now that b is known. This is basic SQL. Outliers as allways
If you decide to do the second step in a cursor you can stuff all that into a stored procedure.
Different card_id groups, that use the same fee scheme can now be coalesced (Sorry of this is the wrong word, non-english native) into fee schemes by rounding a and b with a sane precision and again grouping.
The Hough transform is the most general answer, though I don't know how one would implement it in SQL (rather than pulling the data out and processing it in a general purpose language of your choice).
Alas, the naive version is known to be slow if you have a lot of input data (1000 points is kinda medium sized) and if you want high precision results (scales as size_of_the_input / (rho_precision * theta_precision)
).
There is a faster approach based on 2^n-trees, but there are few implementations out on the web to just plug in. (I recently did one in C++ as a testbed for a project I'm involved in. Maybe I'll clean it up and post it somewhere.)
If there is some additional order to the data you may be able to do better (i.e. do the line segments form a piecewise function?).
Naive Hough transform
Define an accumulator in (theta,rho) space spanning [-pi,pi) and [0,max(hypotenuse(x,y)] as an 2D-array.
Foreach point in the input data
Foreach bin in theta
find the distance rho of the altitude from the origin to
a line through (a,y) and making angle theta with the horizontal
rho = x cos(theta) + y sin(theta)
and increment the bin (theta,rho) in the accumulator
Find the maximum bin in the accumulator, this
represents the most line-like structure in the data
if (theta !=0) {a = rho/sin(theta); b = -1/tan(theta);}
Reliably getting multiple lines out of a single pass takes a little more bookkeeping, but it is not significantly harder.
You can improve the result a little by smoothing the data near the candidate peaks and fitting to get sub-bin precision which should be faster than using smaller bins and should pickup the effect of the "rounding" cents fairly smoothly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With