Currently I have an application hosted on the Google Cloud Platform that offers web analytics and provides session activity (clicks, downloads etc) and ties that web activity with web registrations.
At the moment we store all of our click and session profile data in MySQL and use SQL queries to generate both aggregate and per-user reports, however, as the amount of data has grown, we are seeing a real slow-down in query responses which is in turn slowing down page-load times.
In investigating ways we can solve this problem, we have looked into tools available on Google Cloud Platform like Dataproc and Dataflow as well as NoSQL solutions, however, I am having a hard time understanding how we could apply our current solution to any of these solutions.
Currently, a rough idea of our data schema is as follows:
User table
- id
- name
- email
Profile table (web browser/device)
- id
- user id
- user agent string
Session table
- id
- profile id
- session string
Action table
- id
- session id
- action type
- action details
- timestamp
Based on my research, my understanding of what would be the best solution would be to store action data in a NoSQL database solution like BigTable which feeds data into a solution like DataProc or DataFlow which generates the reports. However, given that our current schema is a highly relational structure, seems to remove the option of moving towards a NoSQL solution as all my research indicates that you shouldn't move relational data to a NoSQL solution.
My question is, is my understanding of how to apply these tools correct? Or are there better solutions? Is it even necessary to consider moving away from MySQL? And if not, what kind of solutions are available that would allow us to possibly pre-process/generate reporting data in the background?
You create a single Sqoop import command that imports data from diverse data sources, such as a relational database, into HDFS. You enter the Sqoop import command on the command line of your cluster to import data from a data source into HDFS.
SQOOP– SQOOP is a command-line interface application that helps in transferring data from RDBMS to Hadoop.
Big data is the set of technologies created to store, analyse and manage this bulk data, a macro-tool created to identify patterns in the chaos of this explosion in information in order to design smart solutions. Today it is used in areas as diverse as medicine, agriculture, gambling and environmental protection.
The definition of big data is data that contains greater variety, arriving in increasing volumes and with more velocity. This is also known as the three Vs. Put simply, big data is larger, more complex data sets, especially from new data sources.
Assuming that sessions
and actions
table values are not updated and only insert. The best way would be to separate the databases into two parts. Keep the MySQL DB for user
and profile
tables and use the BigQuery for actions
and sessions
.
This way you have following:
BigQuery is the best way. But, if you have too many extra resources and time available, you can look into storing it into NoSQL db, then run a pipeline job on it using DataFlow to extract analytics data which you will again need to store in a database for querying purposes.
A couple of questions / potential solutions:
In general, the fastest way to work these questions out is to give it a try for your specific workloads, eg. how many of your typical requests (or random dashboards) can you do on a development machine with a reasonable amount of RAM (or spin up a server/create a different test database).
Also, if you're mostly used to relational databases, there'll be some overhead in switching (particularly for bleeding edge solutions), so you need to be fairly sure that the costs outweigh the benefits before you switch, or switch a little bit at a time so that you can switch back if it doesn't work out. Again, testing helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With