Moving from Relational Database to Big Data

Tags:

Currently I have an application hosted on the Google Cloud Platform that offers web analytics and provides session activity (clicks, downloads etc) and ties that web activity with web registrations.

At the moment we store all of our click and session profile data in MySQL and use SQL queries to generate both aggregate and per-user reports, however, as the amount of data has grown, we are seeing a real slow-down in query responses which is in turn slowing down page-load times.

In investigating ways we can solve this problem, we have looked into tools available on Google Cloud Platform like Dataproc and Dataflow as well as NoSQL solutions, however, I am having a hard time understanding how we could apply our current solution to any of these solutions.

Currently, a rough idea of our data schema is as follows:

User table
- id
- name
- email

Profile table (web browser/device)
- id
- user id
- user agent string

Session table
- id
- profile id
- session string

Action table
- id
- session id
- action type
- action details
- timestamp

Based on my research, my understanding of what would be the best solution would be to store action data in a NoSQL database solution like BigTable which feeds data into a solution like DataProc or DataFlow which generates the reports. However, given that our current schema is a highly relational structure, seems to remove the option of moving towards a NoSQL solution as all my research indicates that you shouldn't move relational data to a NoSQL solution.

My question is, is my understanding of how to apply these tools correct? Or are there better solutions? Is it even necessary to consider moving away from MySQL? And if not, what kind of solutions are available that would allow us to possibly pre-process/generate reporting data in the background?

805

asked Jun 12 '17 02:06

Damon Swayn

2 Answers

Assuming that sessions and actions table values are not updated and only insert. The best way would be to separate the databases into two parts. Keep the MySQL DB for user and profile tables and use the BigQuery for actions and sessions.

This way you have following:

minimize the amount of change you have to do on the either sides (data ingestion and extraction)
you will significantly reduce the cost of data storage
query times will significantly improve
before you know it, you will be in the big data territory and BigQuery is just the solution for it

BigQuery is the best way. But, if you have too many extra resources and time available, you can look into storing it into NoSQL db, then run a pipeline job on it using DataFlow to extract analytics data which you will again need to store in a database for querying purposes.

160

answered Nov 03 '22 23:11

Vikram Tiwari

A couple of questions / potential solutions:

Profile! If it's the same queries thrashing the database, then optimising your queries or caching some of the results for your most frequent pages can help offload processing. Ditto for database settings, RAM, etc.
How big is your database? If it's less than 64GB, scaling up to a larger server where the database can fit into RAM could be a quick win.
How is your data being used? If it's purely for historical data, you could potentially reduce your clicks down into a lookup table, eg. actions per session per week or per user per week. If the data is collated per 5 minutes / hour, downloading the raw data and processing it like this locally can work too.
You can denormalise, eg. combine user agent|session|action type|details|timestamp into one row, but you potentially increase your storage requirements and lookup time.
Alternatively, more normalisation can help too. Breaking out the user agent string into its own table will reduce that table's data requirements and might speed things up.
It seems like your data might be able to be split up / sharded by user, so that could be another option.

In general, the fastest way to work these questions out is to give it a try for your specific workloads, eg. how many of your typical requests (or random dashboards) can you do on a development machine with a reasonable amount of RAM (or spin up a server/create a different test database).

Also, if you're mostly used to relational databases, there'll be some overhead in switching (particularly for bleeding edge solutions), so you need to be fairly sure that the costs outweigh the benefits before you switch, or switch a little bit at a time so that you can switch back if it doesn't work out. Again, testing helps.

answered Nov 03 '22 23:11

Anthony Briggs

Related questions
                            
                                MySQL Triggers - AFTER INSERT trigger + UDF sys_exec() issue
                            
                                How can I make fatal errors of ALL mysql warnings?
                            
                                Storing image data in a MySQL database?
                            
                                Saving Python Pickled objects in MySQL db
                            
                                resetting mysql workbench root password
                            
                                What is best practice when it comes to storing images for a gallery?
                            
                                Is this possible to get total number of rows count with offset limit
                            
                                PHP slow with mysqli
                            
                                NATURAL JOIN vs WHERE IN Clauses
                            
                                Download MySql Backup/Snapshot from Amazon RDS
                            
                                Isolation level with Flask-SQLAlchemy
                            
                                How to run native SQL queries in the same Hibernate transaction?
                            
                                MySQL: Collate in query - any side effects?
                            
                                mysql 5.6 adjust varchar length to longer value without table locking
                            
                                Is minus (negative) zero equivalent to 0 in PHP?
                            
                                Laravel 5: allow user to edit post if he's the owner of the post or the owner of the forum category
                            
                                Load MySQL timezone tables
                            
                                Qt MySQL adapter refuses to connect randomly: Unable to allocate a MYSQL
                            
                                SQLSTATE[22007]: Invalid datetime format: 1366 Incorrect integer value: 'column_name' in Laravel
                            
                                RDS Mysql ERROR 1045 (28000): Access denied for user @IP (using password: YES)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Moving from Relational Database to Big Data

Tags:

performance

mysql

google-cloud-platform

analytics

bigdata

Damon Swayn

People also ask

2 Answers

Vikram Tiwari

Anthony Briggs

Recent Activity

Donate For Us