I'm building a management application to help manage my mobile auto detailing company (and hopefully others). I'm struggling to figure out how to model some of the data.
This question is related to a previous question that I've posted, but I've reproduced the relevant information below: Database design - google app engine
In this application, there are concepts of "Appointments" and "Line Items."
Appointments are a place and time where employees are expected to be in order to deliver a service.
Line Items are a service, fee or discount and its associated information. An example of line items that might go into an appointment:
Name: Price: Commission: Time estimate Full Detail, Regular Size: 160 75 3.5 hours $10 Off Full Detail Coupon: -10 0 0 hours Premium Detail: 220 110 4.5 hours Derived totals(not a line item): $370 $185 8.0 hours
In my previous implementation of this application, Line Items were contained by a single appointment. This worked fine most of the time, but caused problems sometimes. An example would be if an appointment got interrupted half-way through because of rain and the technician had to come back out the next day and finish up. This situation required two appointments for the same line item. In cases like this, I would just fudge the data a little by setting the "line item" on the second appointment to read something like "Finish Up" and then the cost would be $0.
In this next version, I am considering enabling Line Items to be matched with more than one appointment with a table structure that looks like this:
Appointment
start_time
etc...
Line_Item
appointment_Key_List
name
price
etc...
A general problem with this structure is that it is complicated and I'm not even sure if its appropriate to match one line item with multiple appointments. If Line Items can only be part of one Appointment, then I can actually just put a list of line items IN each Appointment, when I get Appointments, I'd already be getting Line Items.
A more specific problem is that I am using google app engine and if I want to query for a set of appointments and their associated line items, I'd have to first query for the set of appointments and then do a second query for the line items using the IN operator to test if any of the Line_Item's appointment keys fall into the set of appointment keys the were returned from the previous query. The second query will fail if I have more than 30 keys requiring me to shard the query. I could denormalize the data to avoid this complicated and extensive read query, and I will probably have to denormalize to some degree anyway, but I'd rather avoid complexity where appropriate.
My question is how is this type of situation usually modeled? Is it even appropriate for a Line Item to be paired with more than one appointment, or is it normal to simply split line items into separate ones for each appointment such as "1st half of 2 day job" and "2nd half of two day job." How do similar successful applications do this? What are the rules of thumb in this type of situation? What implementations have turned out to be less problematic?
Thanks!
The approach you're suggesting will work fine; you can model the line item's 'appointment_Key_list' as a list property and it will work as you expect. You don't have to use the IN operator - that's for matching a single value in the datastore against a list of keys you have (eg, "WHERE datastore_column IN ('a', 'b', 'c')), while you're doing the reverse - matching a single value against a list in the datastore.
I would suggest, though, that the reverse might be better suited to your task: Have each Appointment have a list of line item keys. This operates much the same way, but to retrieve all the data on an appointment, you instead first fetch the appointment, then do a bulk get on the line items, using the keys from the Appointment entity. If you know the key of the Appointment, you're thus avoiding the need to do any queries at all.
I've been trying to explain to Pindatjuh why querying a list property is no less efficient than a singly-valued one, but apparrently a more detailed description is required, so without any further ado, here is...
Although Python and Java provide various high level interfaces to the datastore, the datastore itself speaks a lower-level abstraction, called entities. An entity consist of the following:
The primary key is the Datastore key you're already familiar with. The list of (name, value) pairs is App Engine's representation for the data in your entity. So far so straightforward. An entity with the following values:
a_string = "Hello, world"
an_int = 123
would be serialized to something resembling this:
[('a_string', 'Hello, world'), ('an_int', 123)]
But how does this interact with lists? Well, lists are treated as 'multiply valued' properties. That is, a list with n items is stored as n separate properties. An example probably makes this clearer:
a_string = "Hello, world"
an_int = 123
a_list_of_ints = [42, 314, 9]
will be serialized as:
[('a_string', 'Hello, world'), ('an_int', 123), ('a_list_of_ints', 42), ('a_list_of_ints', 314), ('a_list_of_ints', 9)]
As you can see, the list gets represented a series of values, all with the same name. When you load data from the datastore, the SDK sees the repeated value and turns it into a list.
Where this gets important is when it interacts with indexing. Suppose you have an index on 'a_string' and 'an_int'. When you insert or modify a value, App Engine generates a set of index entries for it; for the above index and the above entity, it generates a single row in the index that looks something like this:
('Hello, world', 123, a_key)
('a_key' here is a placeholder for the key of the original entity.) When you do a query that uses this index, it just needs to do a lookup on the index to find rows with the appropriate prefix (Eg, 'SELECT * FROM Kind WHERE a_string = "Hello, world" ORDER BY an_int').
When you index a list, though, App Engine inserts multiple index rows. An index on 'an_int' and 'a_list_of_ints' would generate these rows for the above entity:
(123, 42, a_key)
(123, 314, a_key)
(123, 9, a_key)
Again, querying works the same as it did previously - App Engine just has to look up the row with the correct prefix in the index. The number of entries in the list has no impact on how fast the query is - only on how long it took to generate and write the index entries. In fact, the query planner has no idea that 'a_list_of_ints' is a multiply valued property - it just treats it like any other index entry.
So in a nutshell:
The usual solution for this kind of problems is normalizing the model, i.e. to the First Normal Form.
Your model, in normalized form, would have a third table, with references to the Appointment
and Line_Item
rows:
Appointment
start_time
...
Line_Item
name
price
...
Appointment_Line_Item
appointment_key
line_item_key
There is a problem however! Since you are using Google App Engine, and their Datastore is quite limited ("GQL cannot perform an SQL-like JOIN") and mostly requires denormalization.
You suggested using a list-like field. It is a possiblity to use this, but it is very hard to index it. Searching for a key (the appointment_key
) in a list per row in the database is not really performing. I propose two possiblities:
Duplicate Line_Item
.
Line_Item
appointment_key
name
price
finished
...
A Line_Item
should have a finished
state, when the item was finished or not by the employee. If an employee hadn't finished all line items, mark them as unfinished, create a new appointment and copy all items that were unfinished. You can index on the appointment_key
field on all Line_Items
, which is a Good Thing. However, the duplicated data may be a problem.
Dynamic fields for Line_Item
:
Line_Item
duplicate_key
appointment_key
name
price
finished
...
Create a new field, duplicate_key
, for Line_Item
which points to another Line_Item
or to null (reserve this key!). Null means that the Line_Item
is original, any other value means that this Line_Item
is a duplicate of the Line_Item
the field points to. All fields of Line_Item
marked as a duplicate inherit the fields of the original Line_Item
, except the appointment_key
: so it will take less storage. Also this solution should have appointment_key
indexed, to speed up lookup times. This requires one additional query per duplicated Line_Item
, which may be a problem.
Now, it's a clear choice: either better speed or better storage. I would go for the first, as it reduces complexity of your model, and storage is never a problem with modern systems. Less complexity generally means less bugs and less development/testing costs, which justifies the cost of the storage requirement.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With