I'm just getting started with learning Hadoop, and I'm wondering the following: suppose I have a bunch of large MySQL production tables that I want to analyze. <ol> <li>It seems like I have to dump all the tables into text files, in order to bring them into the Hadoop filesystem -- is this correct, or is there some way that Hive or Pig or whatever can access the data from MySQL directly?</li> <li>If I'm dumping all the production tables into text files, do I need to worry about affecting production performance during the dump? (Does it depend on what storage engine the tables are using? What do I do if so?)</li> <li>Is it better to dump each table into a single file, or to split each table into 64mb (or whatever my block size is) files?</li> </ol>

2) Since I dont know your environment I will aire on the safe, side - YES, worry about affecting production performance. Depending on the frequency and quantity of data being written, you may find that it processes in an acceptable amount of time, particularly if you are just writing new/changed data. [subject to complexity of your queries] If you dont require real time or your servers have typically periods when they are under utilized (overnight?) then you could create the files at this time. Depending on how you have your environment setup, you could replicate/log ship to specific db server(s) who's sole job is to create your data file(s). 3) No need for you to split the file, HDFS will take care of partitioning the data file into bocks and replicating over the cluster. By default it will automatically split into 64mb data blocks. see - Apache - HDFS Architecture re: Wojtek answer - SQOOP clicky (doesn't work in comments) If you have more questions or specific environment info, let us know HTH Ralph

Pulling data from MySQL into Hadoop

2 Answers

Importing data from mysql can be done very easily. I recommend you to use Cloudera's hadoop distribution, with it comes program called 'sqoop' which provides very simple interface for importing data straight from mysql (other databases are supported too). Sqoop can be used with mysqldump or normal mysql query (select * ...). With this tool there's no need to manually partition tables into files. But for hadoop it's much better to have one big file.

Useful links:
Sqoop User Guide

103

answered Oct 27 '22 00:10

wlk

2)
Since I dont know your environment I will aire on the safe, side - YES, worry about affecting production performance.

Depending on the frequency and quantity of data being written, you may find that it processes in an acceptable amount of time, particularly if you are just writing new/changed data. [subject to complexity of your queries]

If you dont require real time or your servers have typically periods when they are under utilized (overnight?) then you could create the files at this time.

Depending on how you have your environment setup, you could replicate/log ship to specific db server(s) who's sole job is to create your data file(s).

3)
No need for you to split the file, HDFS will take care of partitioning the data file into bocks and replicating over the cluster. By default it will automatically split into 64mb data blocks.
see - Apache - HDFS Architecture

re: Wojtek answer - SQOOP clicky (doesn't work in comments)

If you have more questions or specific environment info, let us know HTH Ralph

answered Oct 26 '22 23:10

Ralph Willgoss

Related questions
                            
                                Have you used any databases only hosting service?
                            
                                Does mySQL Replication: Master DB Name has to be the same as the Slave DB name?
                            
                                How to Represent Rules using a MySQL Table?
                            
                                SQL design for survey with answers of different data types
                            
                                Is there a better way to assign permissions to temporary tables in MySQL?
                            
                                ANSI Sql query to force return 0 records
                            
                                SELECT DISTINCT values after a JOIN
                            
                                Can MYSQL support databases with sizes around 4 GB? Will I have any performance issues?
                            
                                mysql conditional insert on duplicate update - multiple records
                            
                                MySQL AVG(COUNT(*) - Orders By day of week query?
                            
                                Creating A Procedure
                            
                                MySQL Stored Procedures not working with SELECT (basic question)
                            
                                Creation time of Innodb tables
                            
                                Is there any (performance) difference between Debug and Release?
                            
                                Inserting DATE TIMESTAMP Value to MySQL Using PHP
                            
                                MySQL: Get unique values across multiple columns in alphabetical order
                            
                                Is this a secure way to structure a mysql_query in PHP
                            
                                What is wrong with my SQL syntax here?
                            
                                How to optimize this MySQL query
                            
                                MySQL select column length in php (below PHP7)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pulling data from MySQL into Hadoop

Tags:

mysql

hadoop

grautur

People also ask

2 Answers

wlk

Ralph Willgoss

Recent Activity

Donate For Us