I need to analyze 1 TB+ of web access logs, and in particular I need to analyze statistics relating to requested URLs and subsets of the URLs (child branches). If possible, I want the queries to be fast over small subsets of the data (e.g. 10 million requests). For example, given an access log with the following URLs being requested: <pre class="prettyprint"><code>/ocp/about_us.html /ocp/security/ed-209/patches/urgent.html /ocp/security/rc/ /ocp/food/ /weyland-yutani/products/ </code></pre> I want to do queries such as: <ul> <li>Count the number of requests for everything 'below' /ocp. </li> <li>Same as above, but only count requests for child nodes under /ocp/security</li> <li>Return the top 5 most frequently requested URLs.</li> <li>Same as above, except group by an arbitrary depth,</li> </ul> e.g. For the last query above, depth 2 for the data would return: <pre class="prettyprint"><code>2: /ocp/security/ 1: /ocp/ 1: /ocp/food/ 1: /weyland-yutani/products/ </code></pre> I think the ideal approach would probably be to use a column DB and tokenize the URLs such that there is a column for each element in the URL. However, I would really like to find a way to do this with open source apps if possible. HBase is a possibility, but query performance seems too slow to be useful for real-time queries (also, I don't really want to be in the business of re-implementing SQL) I'm aware there are commercial apps for doing this this type of analytics, but for various reasons I want to implement this myself.

Before investing too much time into designing a hierarchical data structure on top of a relational database, consider reading "Naive Trees" section (starting at slide 48) in the excellent presentation SQL Anti-Patterns Strike Back by Bill Karwin. Bill outlines the following methods for developing a hierarchy: <ol> <li>Path enumeration (slide 55) </li> <li>Nested sets (slide 58)</li> <li>Closure table (slide 68)</li> </ol>

Trees are generally not very efficient in databases. I mean: if you'd design the tree to be truly recursive, with items pointing to their parents, you'll get lots of queries to find all sub-nodes. But you can optimize the tree, according to your needs. Put any part of the url into a column is not a bad idea. You need to limit the depth to a certain number of sub-nodes. You could have indexes on any column, which makes it very fast. Queries on such a structure are very simple: <pre class="prettyprint"><code>Select count(*) From Hits where node1 = 'ocp' AND node2 = 'security'; </code></pre> Make a access statistic: <pre class="prettyprint"><code>SELECT node1, node2, count(*) as "number of hits" FROM hits GROUP BY node1, node2 ORDER BY count(*) DESC </code></pre> you'll get <pre class="prettyprint"><code>node1 node2 number of hits 'ocp' 23345 'ocp' 'security' 1020 'ocp' 'food' 234 'weyland-yutani' 'products' 22 </code></pre> You could also store the url as it is and filter using regex. This is more flexible, but slower, because you don't have indexes. You need only to limit the whole length of the url, not the number of sub-nodes. I think you could do this with any database good enough to store large amount of data. For instance MySql.

What is the most efficient way to store and query trees?

Tags:

database-design

I need to analyze 1 TB+ of web access logs, and in particular I need to analyze statistics relating to requested URLs and subsets of the URLs (child branches). If possible, I want the queries to be fast over small subsets of the data (e.g. 10 million requests).

For example, given an access log with the following URLs being requested:

/ocp/about_us.html
/ocp/security/ed-209/patches/urgent.html
/ocp/security/rc/
/ocp/food/
/weyland-yutani/products/

I want to do queries such as:

Count the number of requests for everything 'below' /ocp.
Same as above, but only count requests for child nodes under /ocp/security
Return the top 5 most frequently requested URLs.
Same as above, except group by an arbitrary depth,

e.g. For the last query above, depth 2 for the data would return:

2: /ocp/security/
1: /ocp/
1: /ocp/food/
1: /weyland-yutani/products/

I think the ideal approach would probably be to use a column DB and tokenize the URLs such that there is a column for each element in the URL. However, I would really like to find a way to do this with open source apps if possible. HBase is a possibility, but query performance seems too slow to be useful for real-time queries (also, I don't really want to be in the business of re-implementing SQL)

I'm aware there are commercial apps for doing this this type of analytics, but for various reasons I want to implement this myself.

420

asked May 07 '09 20:05

Rob

2 Answers

Before investing too much time into designing a hierarchical data structure on top of a relational database, consider reading "Naive Trees" section (starting at slide 48) in the excellent presentation SQL Anti-Patterns Strike Back by Bill Karwin. Bill outlines the following methods for developing a hierarchy:

Path enumeration (slide 55)
Nested sets (slide 58)
Closure table (slide 68)

166

answered Nov 16 '22 14:11

Jake McGraw

Trees are generally not very efficient in databases. I mean: if you'd design the tree to be truly recursive, with items pointing to their parents, you'll get lots of queries to find all sub-nodes.

But you can optimize the tree, according to your needs.

Put any part of the url into a column is not a bad idea. You need to limit the depth to a certain number of sub-nodes. You could have indexes on any column, which makes it very fast.

Queries on such a structure are very simple:

Select count(*) From Hits where node1 = 'ocp' AND node2 = 'security';

Make a access statistic:

SELECT node1, node2, count(*) as "number of hits"
FROM hits 
GROUP BY node1, node2
ORDER BY count(*) DESC

you'll get

node1            node2        number of hits
'ocp'                        23345
'ocp'            'security'   1020
'ocp'            'food'        234
'weyland-yutani' 'products'     22

You could also store the url as it is and filter using regex. This is more flexible, but slower, because you don't have indexes. You need only to limit the whole length of the url, not the number of sub-nodes.

I think you could do this with any database good enough to store large amount of data. For instance MySql.

answered Nov 16 '22 15:11

Stefan Steinegger

Related questions
                            
                                Why am I getting a an error when creating a generated column in PostgreSQL?
                            
                                Best way to save only day and month in database
                            
                                How would you structure your entity model for storing arbitrary key/value data with different data types?
                            
                                What's the best way to store different images in the database?
                            
                                How to design shopping basket using session?
                            
                                Indexes on join tables
                            
                                How to learn about designing highly transactional systems?
                            
                                Table with only one column or add a numeric primary key?
                            
                                postgresql: data type for md5 message digest?
                            
                                Agile development and database changes [closed]
                            
                                Dynamic or column-ized tsvector index?
                            
                                Database warehouse design: fact tables and dimension tables
                            
                                How to CASCADE a delete from a child table to the parent table?
                            
                                Database performance: filtering on column vs. separate table
                            
                                Is there a speed difference in ordering by int vs. float?
                            
                                Database design, variable number of columns
                            
                                What is the difference between an Information Model and an Ontology?
                            
                                Is it better to store redundant information or join tables when necessary in MySQL?
                            
                                How to choose the clustered index in SQL Server?
                            
                                SQL Server - Performance/Size Drawbacks of Null Columns

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With