Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Writing a Domain Specific Language for selecting rows from a table

I'm writing a server that I expect to be run by many different people, not all of whom I will have direct contact with. The servers will communicate with each other in a cluster. Part of the server's functionality involves selecting a small subset of rows from a potentially very large table. The exact choice of what rows are selected will need some tuning, and it's important that it's possible for the person running the cluster (eg, myself) to update the selection criteria without getting each and every server administrator to deploy a new version of the server.

Simply writing the function in Python isn't really an option, since nobody is going to want to install a server that downloads and executes arbitrary Python code at runtime.

What I need are suggestions on the simplest way to implement a Domain Specific Language to achieve this goal. The language needs to be capable of simple expression evaluation, as well as querying table indexes and iterating through the returned rows. Ease of writing and reading the language is secondary to ease of implementing it. I'd also prefer not to have to write an entire query optimiser, so something that explicitly specifies what indexes to query would be ideal.

The interface that this will have to compile against will be similar in capabilities to what the App Engine datastore exports: You can query for sequential ranges on any index on the table (eg, less-than, greater-than, range and equality queries), then filter the returned row by any boolean expression. You can also concatenate multiple independent result sets together.

I realise this question sounds a lot like I'm asking for SQL. However, I don't want to require that the datastore backing this data be a relational database, and I don't want the overhead of trying to reimplement SQL myself. I'm also dealing with only a single table with a known schema. Finally, no joins will be required. Something much simpler would be far preferable.

Edit: Expanded description to clear up some misconceptions.

like image 319
Nick Johnson Avatar asked Sep 26 '08 14:09

Nick Johnson


2 Answers

Building a DSL to be interpreted by Python.

Step 1. Build the run-time classes and objects. These classes will have all the cursor loops and SQL statements and all of that algorithmic processing tucked away in their methods. You'll make heavy use of the Command and Strategy design patterns to build these classes. Most things are a command, options and choices are plug-in strategies. Look at the design for Apache Ant's Task API -- it's a good example.

Step 2. Validate that this system of objects actually works. Be sure that the design is simple and complete. You're tests will construct the Command and Strategy objects, and then execute the top-level Command object. The Command objects will do the work.

At this point you're largely done. Your run-time is just a configuration of objects created from the above domain. [This isn't as easy as it sounds. It requires some care to define a set of classes that can be instantiated and then "talk among themselves" to do the work of your application.]

Note that what you'll have will require nothing more than declarations. What's wrong with procedural? One you start to write a DSL with procedural elements, you find that you need more and more features until you've written Python with different syntax. Not good.

Further, procedural language interpreters are simply hard to write. State of execution, and scope of references are simply hard to manage.

You can use native Python -- and stop worrying about "getting out of the sandbox". Indeed, that's how you'll unit test everything, using a short Python script to create your objects. Python will be the DSL.

["But wait", you say, "If I simply use Python as the DSL people can execute arbitrary things." Depends on what's on the PYTHONPATH, and sys.path. Look at the site module for ways to control what's available.]

A declarative DSL is simplest. It's entirely an exercise in representation. A block of Python that merely sets the values of some variables is nice. That's what Django uses.

You can use the ConfigParser as a language for representing your run-time configuration of objects.

You can use JSON or YAML as a language for representing your run-time configuration of objects. Ready-made parsers are totally available.

You can use XML, too. It's harder to design and parse, but it works fine. People love it. That's how Ant and Maven (and lots of other tools) use declarative syntax to describe procedures. I don't recommend it, because it's a wordy pain in the neck. I recommend simply using Python.

Or, you can go off the deep-end and invent your own syntax and write your own parser.

like image 172
S.Lott Avatar answered Sep 30 '22 11:09

S.Lott


I think we're going to need a bit more information here. Let me know if any of the following is based on incorrect assumptions.

First of all, as you pointed out yourself, there already exists a DSL for selecting rows from arbitrary tables-- it is called "SQL". Since you don't want to reinvent SQL, I'm assuming that you only need to query from a single table with a fixed format.

If this is the case, you probably don't need to implement a DSL (although that's certainly one way to go); it may be easier, if you are used to Object Orientation, to create a Filter object.

More specifically, a "Filter" collection that would hold one or more SelectionCriterion objects. You can implement these to inherit from one or more base classes representing types of selections (Range, LessThan, ExactMatch, Like, etc.) Once these base classes are in place, you can create column-specific inherited versions which are appropriate to that column. Finally, depending on the complexity of the queries you want to support, you'll want to implement some kind of connective glue to handle AND and OR and NOT linkages between the various criteria.

If you feel like it, you can create a simple GUI to load up the collection; I'd look at the filtering in Excel as a model, if you don't have anything else in mind.

Finally, it should be trivial to convert the contents of this Collection to the corresponding SQL, and pass that to the database.

However: if what you are after is simplicity, and your users understand SQL, you could simply ask them to type in the contents of a WHERE clause, and programmatically build up the rest of the query. From a security perspective, if your code has control over the columns selected and the FROM clause, and your database permissions are set properly, and you do some sanity checking on the string coming in from the users, this would be a relatively safe option.

like image 43
Michael Dorfman Avatar answered Sep 30 '22 11:09

Michael Dorfman