Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

merge multiple small files in to few larger files in Spark

I using hive through Spark. I have a Insert into partitioned table query in my spark code. The input data is in 200+gb. When Spark is writing to a partitioned table, it is spitting very small files(files in kb's). so now the output partitioned table folder have 5000+ small kb files. I want to merge these in to few large MB files, may be about few 200mb files. I tired using hive merge settings, but they don't seem to work.

'val result7A = hiveContext.sql("set hive.exec.dynamic.partition=true")

 val result7B = hiveContext.sql("set hive.exec.dynamic.partition.mode=nonstrict")

val result7C = hiveContext.sql("SET hive.merge.size.per.task=256000000")

val result7D = hiveContext.sql("SET hive.merge.mapfiles=true")

val result7E = hiveContext.sql("SET hive.merge.mapredfiles=true")

val result7F = hiveContext.sql("SET hive.merge.sparkfiles = true")

val result7G = hiveContext.sql("set hive.aux.jars.path=c:\\Applications\\json-serde-1.1.9.3-SNAPSHOT-jar-with-dependencies.jar")

val result8 = hiveContext.sql("INSERT INTO TABLE partition_table PARTITION (date) select a,b,c from partition_json_table")'

The above hive settings work in a mapreduce hive execution and spits out files of specified size. Is there any option to do this Spark or Scala?

like image 771
dheee Avatar asked Jun 23 '15 17:06

dheee


People also ask

How do I combine small files in Hadoop?

Hadoop -getmerge command is used to merge multiple files in an HDFS(Hadoop Distributed File System) and then put it into one single output file in our local file system. We want to merge the 2 files present inside are HDFS i.e. file1. txt and file2. txt, into a single file output.

What is coalesce in Spark?

What is Coalesce? The coalesce method reduces the number of partitions in a DataFrame. Coalesce avoids full shuffle, instead of creating new partitions, it shuffles the data using Hash Partitioner (Default), and adjusts into existing partitions, this means it can only decrease the number of partitions.

Can we merge parquet files?

If you only want to combine the files from a single partition, you can copy the data to a different table, drop the old partition, then insert into the new partition to produce a single compacted partition.


1 Answers

I had the same issue. Solution was to add DISTRIBUTE BY clause with the partition columns. This ensures that data for one partition goes to single reducer. Example in your case:

INSERT INTO TABLE partition_table PARTITION (date) select a,b,c from partition_json_table DISTRIBUTE BY date
like image 67
Jussi Kujala Avatar answered Oct 12 '22 13:10

Jussi Kujala