Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance issue while reading data from hive using python

I have a table in hive with 351 837 (110 MB size) records and i am reading this table using python and writing into sql server.

In this process while reading data from hive into pandas dataframe it is taking long time. When i load entire records (351k) it takes 90 minutes.

To improve i went with following approach like reading 10k rows once from hive and writing into sql server. But reading 10k rows once from hive and assigning it to Dataframe is alone taking 4-5 minutes of time.

def execute_hadoop_export():
       """
       This will run the steps required for a Hadoop Export.  
       Return Values is boolean for success fail
       """
       try:

           hql='select * from db.table '
           # Open Hive ODBC Connection
           src_conn = pyodbc.connect("DSN=****",autocommit=True)
           cursor=src_conn.cursor()
           #tgt_conn = pyodbc.connect(target_connection)

           # Using SQLAlchemy to dynamically generate query and leverage dataframe.to_sql to write to sql server...
           sql_conn_url = urllib.quote_plus('DRIVER={ODBC Driver 13 for SQL Server};SERVER=Xyz;DATABASE=Db2;UID=ee;PWD=*****')
           sql_conn_str = "mssql+pyodbc:///?odbc_connect={0}".format(sql_conn_url)
           engine = sqlalchemy.create_engine(sql_conn_str)
           # read source table.
           vstart=datetime.datetime.now()
           for df in pandas.read_sql(hql, src_conn,chunksize=10000):

               vfinish=datetime.datetime.now()

               print 'Finished 10k rows reading from hive and it took', (vfinish-vstart).seconds/60.0,' minutes'
           # Get connection string for target from Ctrl.Connnection

               df.to_sql(name='table', schema='dbo', con=engine, chunksize=10000, if_exists="append", index=False) 
               print 'Finished 10k rows writing into sql server and it took', (datetime.datetime.now()-vfinish).seconds/60.0, ' minutes'
               vstart=datetime.datetime.now()
           cursor.Close()


       except Exception, e:
           print str(e)

output:

Result

What is the fastest way to read hive table data in python?

Update hive table structure

CREATE TABLE `table1`(
  `policynumber` varchar(15), 
  `unitidentifier` int, 
  `unitvin` varchar(150), 
  `unitdescription` varchar(100), 
  `unitmodelyear` varchar(4), 
  `unitpremium` decimal(18,2), 
  `garagelocation` varchar(150), 
  `garagestate` varchar(50), 
  `bodilyinjuryoccurrence` decimal(18,2), 
  `bodilyinjuryaggregate` decimal(18,2), 
  `bodilyinjurypremium` decimal(18,2), 
  `propertydamagelimits` decimal(18,2), 
  `propertydamagepremium` decimal(18,2), 
  `medicallimits` decimal(18,2), 
  `medicalpremium` decimal(18,2), 
  `uninsuredmotoristoccurrence` decimal(18,2), 
  `uninsuredmotoristaggregate` decimal(18,2), 
  `uninsuredmotoristpremium` decimal(18,2), 
  `underinsuredmotoristoccurrence` decimal(18,2), 
  `underinsuredmotoristaggregate` decimal(18,2), 
  `underinsuredmotoristpremium` decimal(18,2), 
  `umpdoccurrence` decimal(18,2), 
  `umpddeductible` decimal(18,2), 
  `umpdpremium` decimal(18,2), 
  `comprehensivedeductible` decimal(18,2), 
  `comprehensivepremium` decimal(18,2), 
  `collisiondeductible` decimal(18,2), 
  `collisionpremium` decimal(18,2), 
  `emergencyroadservicepremium` decimal(18,2), 
  `autohomecredit` tinyint, 
  `lossfreecredit` tinyint, 
  `multipleautopoliciescredit` tinyint, 
  `hybridcredit` tinyint, 
  `goodstudentcredit` tinyint, 
  `multipleautocredit` tinyint, 
  `fortyfivepluscredit` tinyint, 
  `passiverestraintcredit` tinyint, 
  `defensivedrivercredit` tinyint, 
  `antitheftcredit` tinyint, 
  `antilockbrakescredit` tinyint, 
  `perkcredit` tinyint, 
  `plantype` varchar(100), 
  `costnew` decimal(18,2), 
  `isnocontinuousinsurancesurcharge` tinyint)
CLUSTERED BY ( 
  policynumber, 
  unitidentifier) 
INTO 50 BUCKETS

Note: I have also tried with sqoop export option but my hive table is already in bucketting format.

like image 870
Tharunkumar Reddy Avatar asked Jul 08 '19 09:07

Tharunkumar Reddy


Video Answer


1 Answers

What is the best way to read the output from disk with Pandas after using cmd.get_results ? (e.g. from a Hive command). For example, consider the following:

out_file = 'results.csv'
delimiter = chr(1)
....

Qubole.configure(qubole_key)
hc_params = ['--query', query]
hive_args = HiveCommand.parse(hc_params)
cmd = HiveCommand.run(**hive_args)
if (HiveCommand.is_success(cmd.status)):
    with open(out_file, 'wt') as writer:
        cmd.get_results(writer, delim=delimiter, inline=False)

If, after successfully running the query, I then inspect the first few bytes of results.csv, I see the following: $ head -c 300 results.csv b'flight_uid\twinning_price\tbid_price\timpressions_source_timestamp\n'b'0FY6ZsrnMy\x012000\x012270.0\x011427243278000\n0FamrXG9AW\x01710\x01747.0\x011427243733000\n0FY6ZsrnMy\x012000\x012270.0\x011427245266000\n0FY6ZsrnMy\x012000\x012270.0\x011427245088000\n0FamrXG9AW\x01330\x01747.0\x011427243407000\n0FamrXG9AW\x01710\x01747.0\x011427243981000\n0FamrXG9AW\x01490\x01747.0\x011427245289000\n When I try to open this in Pandas:

df = pd.read_csv('results.csv')

it obviously doesn't work (I get an empty DataFrame), since it isn't properly formatted as a csv file. While I could try to open results.csv and post-process it (to remove b', etc.) before I open it in Pandas, this would be a quite hacky way to load it. Am I using the interface correctly? This is using the very last version of qds_sdk: 1.4.2 from a three hours ago.

like image 175
Song Avatar answered Sep 22 '22 04:09

Song