Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient search in nested HBase entities

If I follow Ian Varley's HBase design practice and store a bunch of nested entities in the same HBase entity (to benefit from HBase's single row ACID properties), would it be possible to efficiently search or even MapReduce over these nested entities to decide using a certain criteria whether the encapsulating (parent) entity be selected or not?

For example, I have a customer entity with order entities nested in it

enter image description here

CustomerInfo and Orders are column families. For orders column family (which is interesting here), 1, 2 … 6 are column names (which are dynamic in HBase and could be added on the fly) and the text next to these are order entity details (I serialized these details as text, but the serialization does not matter as HBase does not care)

If I have lots (more details below, in 3.) of entities like this customer entity

  1. Would it be possible to select customer entities based on MapReduce (map only?) approach or any other efficient approach that scans customer entities, reads values of customer orders inside these customer entities and returns only those customer entities that contain orders with specific criteria (e.g. Cost > 40)?

  2. Similarly, would it be possible to return the order entities that match the specified criteria (Cost > 40) along with the customer entities to display customers and their most expensive orders?

  3. Could this selection operation be made considerably fast (less than a second?) if the number of orders per customer is very large (up to 100,000) and the number of customers is also large (up to 100,000)? Lets assume that I could build a very large HBase cluster (as needed) for that.

  4. Since I believe that 3) is not possible (as a single MapReduce worker would have to be processing those 100,000 serialized orders), what would be a better design for the this problem (selecting customers based on their order attributes fast)? Would de-normalizing customer entities into order entities that include customer information be a better approach?

like image 612
user1234883 Avatar asked Jan 28 '26 20:01

user1234883


1 Answers

  1. Would it be possible to select customer entities based on MapReduce (map only?) approach or any other efficient approach that scans customer entities, reads values of customer orders inside these customer entities and returns only those customer entities that contain orders with specific criteria (e.g. Cost > 40)?
  2. Similarly, would it be possible to return the order entities that match the specified criteria (Cost > 40) along with the customer entities to display customers and their most expensive orders?

It sure is possible to select entities based on MapReduce approach since a map can process all data in a rowkey, then you can parse the data, filter what you need, and write only the data that you need.

  1. Could this selection operation be made considerably fast (less than a second?) if the number of orders per customer is very large (up to 100,000) and the number of customers is also large (up to 100,000)? Lets assume that I could build a very large HBase cluster (as needed) for that.

I don't think MapReduce are designed for on-the-fly process as it's more suited for batch process. You could try using spark for that.

  1. Since I believe that 3) is not possible (as a single MapReduce worker would have to be processing those 100,000 serialized orders), what would be a better design for the this problem (selecting customers based on their order attributes fast)? Would de-normalizing customer entities into order entities that include customer information be a better approach?

You could alter the design for utilizing HBase scan procedure and it's filter. Instead of
1:"ItemA;Cost:$12"
you could try
1-ItemA:"12"
or
ItemA-1:"12"
or maybe stored the value in integer byte rather than string so you could use scan with value filter to filter the returned result according to your needs

Or you could try multi-layer architecture where you have the data table for processing, and aggregated table for real-time access

like image 93
Averman Avatar answered Jan 31 '26 09:01

Averman



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!