Thoughts on minimize code and maximize data philosophy

Question

I have heard of the concept of minimizing code and maximizing data, and was wondering what advice other people can give me on how/why I should do this when building my own systems?

sarnold · Accepted Answer

Typically data-driven code is easier to read and maintain. I know I've seen cases where data-driven has been taken to the extreme and winds up very unusable (I'm thinking of some SAP deployments I've used), but coding your own "Domain Specific Languages" to help you build your software is typically a huge time saver.

The pragmatic programmers remain in my mind the most vivid advocates of writing little languages that I have read. Little state machines that run little input languages can get a lot accomplished with very little space, and make it easy to make modifications.

A specific example: consider a progressive income tax system, with tax brackets at $1,000, $10,000, and $100,000 USD. Income below $1,000 is untaxed. Income between $1,000 and $9,999 is taxed at 10%. Income between $10,000 and $99,999 is taxed at 20%. And income above $100,000 is taxed at 30%. If you were write this all out in code, it'd look about as you suspect:

total_tax_burden(income) {
    if (income < 1000)
        return 0
    if (income < 10000)
        return .1 * (income - 1000)
    if (income < 100000)
        return 999.9 + .2 * (income - 10000)
    return 18999.7 + .3 * (income - 100000)
}

Adding new tax brackets, changing the existing brackets, or changing the tax burden in the brackets, would all require modifying the code and recompiling.

But if it were data-driven, you could store this table in a configuration file:

1000:0
10000:10
100000:20
inf:30

Write a little tool to parse this table and do the lookups (not very difficult, right?) and now anyone can easily maintain the tax rate tables. If congress decides that 1000 brackets would be better, anyone could make the tables line up with the IRS tables, and be done with it, no code recompiling necessary. The same generic code could be used for one bracket or hundreds of brackets.

And now for something that is a little less obvious: testing. The AppArmor project has hundreds of tests for what system calls should do when various profiles are loaded. One sample test looks like this:

#! /bin/bash
# $Id$

#   Copyright (C) 2002-2007 Novell/SUSE
#
#   This program is free software; you can redistribute it and/or
#   modify it under the terms of the GNU General Public License as
#   published by the Free Software Foundation, version 2 of the
#   License.

#=NAME open
#=DESCRIPTION 
# Verify that the open syscall is correctly managed for confined profiles.  
#=END

pwd=`dirname $0`
pwd=`cd $pwd ; /bin/pwd`

bin=$pwd

. $bin/prologue.inc

file=$tmpdir/file
okperm=rw
badperm1=r
badperm2=w

# PASS UNCONFINED
runchecktest "OPEN unconfined RW (create) " pass $file

# PASS TEST (the file shouldn't exist, so open should create it
rm -f ${file}
genprofile $file:$okperm
runchecktest "OPEN RW (create) " pass $file

# PASS TEST
genprofile $file:$okperm
runchecktest "OPEN RW" pass $file

# FAILURE TEST (1)
genprofile $file:$badperm1
runchecktest "OPEN R" fail $file

# FAILURE TEST (2)
genprofile $file:$badperm2
runchecktest "OPEN W" fail $file

# FAILURE TEST (3)
genprofile $file:$badperm1 cap:dac_override
runchecktest "OPEN R+dac_override" fail $file

# FAILURE TEST (4)
# This is testing for bug: https://bugs.wirex.com/show_bug.cgi?id=2885
# When we open O_CREAT|O_RDWR, we are (were?) allowing only write access
# to be required.
rm -f ${file}
genprofile $file:$badperm2
runchecktest "OPEN W (create)" fail $file

It relies on some helper functions to generate and load profiles, test the results of the functions, and report back to users. It is far easier to extend these little test scripts than it is to write this sort of functionality without a little language. Yes, these are shell scripts, but they are so far removed from actual shell scripts ;) that they are practically data.

I hope this helps motivate data-driven programming; I'm afraid I'm not as eloquent as others who have written about it, and I certainly haven't gotten good at it, but I try.

thkala · Answer

In modern software the line between code and data can become awfully thin and blurry, and it is not always easy to tell the two apart. After all, as far as the computer is concerned, everything is data, unless it is determined by existing code - normally the OS - to be otherwise. Even programs have to be loaded into memory as data, before the CPU can execute them.

For example, imagine an algorithm that computes the cost of an order, where larger orders get lower prices per item. It is part of a larger software system in a store, written in C.

This algorithm is written in C and reads a file that contains an input table provided by the management with the various per-item prices and the corresponding order size thresholds. Most people would argue that a file with a simple input table is, of course, data.

Now, imagine that the store changes its policy to some sort of asymptotic function, rather than pre-selected thresholds, so that it can accommodate insanely large orders. They might also want to factor in exchange rates and inflation - or whatever else the management people come up with.

The store hires a competent programmer and she embeds a nice mathematical expression parser in the original C code. The input file now contains an expression with global variables, functions such as log() and tan(), as well as some simple stuff like the Planck constant and the rate of carbon-14 degradation.

cost = (base * ordered * exchange * ... + ... / ...)^13

Most people would still argue that the expression, even if not as simple as a table, is in fact data. After all it is probably provided as-is by the management.

The store receives a large amount of complaints from clients that became brain-dead trying to estimate their expenses and from the accounting people about the large amount of loose change. The store decides to go back to the table for small orders and use a Fibonacci sequence for larger orders.

The programmer gets tired of modifying and recompiling the C code, so she embeds a Python interpretter instead. The input file now contains a Python function that polls a roomfull of Fib(n) monkeys for the cost of large orders.

Question: Is this input file data?

From a strict technical point, there is nothing different. Both the table and the expression needed to be parsed before usage. The mathematical expression parser probably supported branching and functions - it might not have been Turing-complete, but it still used a language of its own (e.g. MathML).

Yet now many people would argue that the input file just became code.

So what is the distinguishing feature that turns the input format from data into code?

Modifiability: Having to recompile the whole system to effect a change is a very good indication of a code-centric system. Yet I can easily imagine (well, more like I have actually seen) software that has been designed incompetently enough to have e.g. an input table built-in at compile time. And let's not forget that many applications still have icons - that most people would deem data - built in their executables.
Input format: This is the - in my opinion, naively - most common factor that people consider: "If it is in a programming language then it is code". Fine, C is code - you have to compile it after all. I would also agree that Python is also code - it is a full blown language. So why isn't XML/XSL code? XSL is a quite complex language in its own right - hence the L in its name.

In my opinion, none of these two criteria is the actual distinguishing feature. I think that people should consider something else:

Maintainability: In short, if the user of the system has to hire a third party to make the expertise needed to modify the behaviour of the system available, then the system should be considered code-centric to a degree.

This, of course, means that whether a system is data-driven or not should be considered at least in relation to the target audience - if not in relation to the client on a case-by-case basis.

It also means that the distinction can be impacted by the available toolset. The UML specification is a nightmare to go through, but these days we have all those graphical UML editors to help us. If there was some kind of third-party high-level AI tool that parses natural language and produces XML/Python/whatever, then the system becomes data-driven even for far more complex input.

A small store probably does not have the expertise or the resources to hire a third party. So, something that allows the workers to modify its behaviour with the knowledge that one would get in an average management course - mathematics, charts etc - could be considered sufficiently data-driven for this audience.

On the other hand, a multi-billion international corporation usually has in its payroll a bunch of IT specialists and Web designers. Therefore, XML/XSL, Javascript, or even Python and PHP are probably easy enough for it to handle. It also has complex enough requirements that something simpler might just not cut it.

I believe that when designing a software system, one should strive to achieve that fine balance in the used input formats where the target audience can do what they need to, without having to frequently call on third parties.

It should be noted that outsourcing blurs the lines even more. There are quite a few issues, for which the current technology simply does not allow the solution to be approachable by the layman. In that case the target audience of the solution should probably be considered to be the third party to which the operation would be outsourced to. That third party can be expected to employ a fair number of experts.

kirakun · Answer

One of five maxims under the Unix Philosophy, as presented by Rob Pike, is this:

Data dominates. If you have chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.

It is often shortened to, "write stupid code that uses smart data."

Thoughts on minimize code and maximize data philosophy

Tags:

data-driven

yazz.com

3 Answers

sarnold

thkala

kirakun

Recent Activity

Donate For Us

Thoughts on minimize code and maximize data philosophy

Tags:

data-driven

yazz.com

3 Answers

sarnold

thkala

kirakun

Related questions

Recent Activity

Donate For Us