Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NoSQL Schemaless data and statically typed language

One of the key benefits of NoSQL data stores like MongoDB is that they're schemaless. With dynamically typed languages this seem to be a natural fit. You can receive some arbitrary JSON inputs, perform business logic on the known fields, and persist the whole thing without first having to define the object.

What if your choice of language is limited to the statically typed, say Java? How could I achieve the same level of flexibility?

A typical data flow like the following:

  1. JSON Input
  2. Serialize to Java Object to perform business logic
  3. Deserialize into BSON to persist in Mongo

where the serialization to object step is necessary since you want to perform business logic with POJOs, not JSON strings. However, before I can serialize the input into objects, I must define it first. What if the input contains additional fields undefined in the object? While they may not be used in the business logic, I may still want to be able to persist them. I have seem implementations where the undefined fields are put into a map, but am not sure if that's the best approach. For one, the undefined fields may be complex objects as well.

like image 432
ltfishie Avatar asked Nov 04 '11 05:11

ltfishie


2 Answers

Schemaless data doesn't necessarily mean structureless data; the fields are typically known in advance and some type-safe pattern can be applied on top of it to avoid the Magic Container anti-pattern But this is not always the case. Sometimes keys are entered by the user and cannot be known in advance.

I've used the Role Object Pattern several times to give coherence to a dynamic structure. I think it is well suited here for both cases.

The Role Object Pattern defines a way to access different views of an object. The canonical example being a User that can assume several roles such as Customer, Vendor, and Seller. Each of these views has different operations it can perform and can be accessed from any of the other views. Common fields are typically available at the interface level (especially userId(), or in your case toJson()).

Here's an example of using the pattern:

public void displayPage(User user) {
    display(user.getName());

    if (user.hasView(Customer.class))
       displayShoppingCart(user.getView(Customer.class);

    if (user.hasView(Seller.class))
       displayProducts(user.getView(Seller.class));
}

In the case of data with a known structure, you can have several views bringing different sets of keys into cohesive units. These different views can read the json data on construction.

In the case of data with a dynamic structure, an authoritative RawDataView can have the data in it's dynamic form (ie. a Magic Container like a HashMap<String, Object>). This can be used to query the dynamic data. At the same time, type-safe wrappers can be created lazily and can delegate to the RawDataView to assist in program readability/maintainability:

 public class Customer implements User {
     private final RawDataView data;
     public CustomerView(UserView source) {
         this.data = source.getView(RawDataView.class);
     }

     // All User views must specify this
     @Override
     public long id() {
         return data.getId();
     }

     @Override
     public <T extends UserView> T getView(Class<T> view) {
         // construct or look up view
     }

     @Override
     public Json toJson() {
         return data.toJson();
     }


     //
     // Specific to Customer
     //
     public List<Item> shoppingCart() {
         List<Item> items = (List<Item>) data.getValue("items", List.class); 
     }

     // etc....
 }

I've had success with both of these approaches. Here are some extra pointers that I've discovered along the way:

  • Have a static structure structure to your data as much as possible. This makes things a lot easier to maintain. I had to break this rule and use the RawDataView approach when working on a legacy system. You may also have to break it with dynamically-entered user data as mentioned above. In which case, use a convention for non-dynamic field names such as a leading underscore (_userId)
  • Have equals() and hashcode() implemented such that user.getView(A.class).equals(user.getView(B.class)) is always true for the same user.
  • Have a UserCore class that does all the heavy lifting of common code such as creating views; performing common operations (like toJson()) returning common fields (like userId()); and implementing equals() and hashcode(). Have all views delegate to this core object
  • Have an AbstractUserView that delegates to the UserCore and implements equals() and hashcode()
  • Use a type-safe heterogeneous container (like ClassToInstanceMap) constructing/caching views.
  • Allow the existence of a view to be queried. This can be done with either a hasView() method or by having getView return Optional<T>
like image 99
Michael Deardeuff Avatar answered Oct 13 '22 22:10

Michael Deardeuff


You can always have a class which provides both:

  • easy access to attributes you know about and optional fallback cases to older formats (for example it can return "name" if it exists, or older case of "name.first" + "name.last" if it doesn't (or some similar scenario))
  • easy access to unknown elements simulating the map interface

Whether you do a full validation or not, whether you allow extra undefined attributes or not depends on what you want to achieve. But I think that creating an abstraction which allows you either way of accessing the data is the best solution.

Hopefully over time, you'll get to the stage where your schema is pretty much stable and messing directly with the attributes is not needed anymore.

like image 34
viraptor Avatar answered Oct 13 '22 22:10

viraptor