Working with input splits(HADOOP)

Tags:

I have a .txt file as follows:

This is xyz

This is my home

This is my PC

This is my room

This is ubuntu PC xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxxxxxxxxxxxxxxxxxxx

(ignoring the blank line after each record)

I have set the block size as 64 bytes. What I am trying to check is, whether there exists a situation when a single record is broken into two blocks or not.

Now logically, since the block size is 64 bytes , after uploading the file to HDFS, it should create 3 blocks of size 64,64,27 bytes respectively, which it does. Also since the size of the first block is 64 bytes, it should contain the following data only :

This is xyz

This is my home

This is my PC

This is my room

Now I want to see if the first block is like this or not, if I browse the HDFS via the browser and download the file, it downloads the entire file not a single block.

So I decided to run a map-reduce job which would only display the record values only.( Setting reducers=0, and mapper output as context.write(null,record_value), also changing the default delimiter to "")

Now while running the job the job counters show 3 splits, which is obvious, but after completion when I check the output directory, it shows 3 mapper output files out of which 2 are empty and the first mapper output file has all the content of the file as it is.

Can anyone help me with this? Is there a possibility that the newer versions of hadoop handle incomplete records automatically?

572

asked Mar 16 '17 18:03

User9523

1 Answers

Steps followed to reproduce the scenario
1) Created a file sample.txt with the content with total size ~153B

cat sample.txt

This is xyz
This is my home
This is my PC
This is my room
This is ubuntu PC xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxxxxxxxxxxxxxxxxxxx

2) Added the property to hdfs-site.xml

<property>
    <name>dfs.namenode.fs-limits.min-block-size</name>
    <value>10</value>
</property>

and loaded into HDFS with block size as 64B.

hdfs dfs -Ddfs.bytes-per-checksum=16 -Ddfs.blocksize=64 -put sample.txt /

This created three blocks of sizes 64B, 64B and 25B.

Content in Block0:

This is xyz
This is my home
This is my PC
This is my room
This i

Content in Block1:

s ubuntu PC xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xx

Content in Block2:

xx xxxxxxxxxxxxxxxxxxxxx

3) A simple mapper.py

#!/usr/bin/env python

import sys

for line in sys.stdin:
    print line

4) Hadoop Streaming with 0 reducers:

yarn jar hadoop-streaming-2.7.1.jar -Dmapreduce.job.reduces=0 -file mapper.py -mapper mapper.py -input /sample.txt -output /splittest

Job ran with 3 input splits invoking 3 mappers and generated 3 output files with one file holding the entire content of sample.txt and the rest 0B files.

hdfs dfs -ls /splittest

-rw-r--r--   3 user supergroup          0 2017-03-22 11:13 /splittest/_SUCCESS
-rw-r--r--   3 user supergroup        168 2017-03-22 11:13 /splittest/part-00000
-rw-r--r--   3 user supergroup          0 2017-03-22 11:13 /splittest/part-00001
-rw-r--r--   3 user supergroup          0 2017-03-22 11:13 /splittest/part-00002

The file sample.txt is split into 3 splits and these splits are assigned to each mapper as,

mapper1: start=0, length=64B
mapper2: start=64, length=64B
mapper3: start=128, length=25B

This only determines which portion of file has to be read by the mapper, not necessary that it has to be exact. The actual content that is read by a mapper is determined by the FileInputFormat and its boundaries, here TextFileInputFormat.

This uses LineRecordReader to read the content from each split and uses \n as delimiter (line boundary). For a file that isn't compressed, the lines are read by each mapper as explained below.

For the mapper whose start index is 0, the line reading starts from the start of the split. If the split ends with \n the reading ends at the split boundary else it looks for the first \n post the length of the split assigned (here 64B). Such that it does not end up processing a partial line.

For all the other mappers (start index != 0), it checks whether the preceding character from its start index (start - 1) is \n, if yes it reads the content from the start of the split else it skips the content that is present between its start index and the first \n character encountered in that split (as this content is handled by other mapper) and starts to read from the first \n.

Here, mapper1 (start index is 0) starts with Block0 whose split ends at the middle of a line. Thus, it continues to read the line which consumes the entire Block1 and since Block1 does not have a \n character, mapper1 continues to read until it finds a \n which ends with consuming of entire Block2 as well. That is how the entire content of sample.txt ended up in single mapper output.

mapper2 (start index != 0), one character preceding to its start index is not a \n, so skips the line and ends up with no content. Empty mapper output. mapper3 has the identical scenario as mapper2.

Try changing the content of sample.txt like this to see different results

This is xyz
This is my home
This is my PC
This is my room
This is ubuntu PC xxxx xxxx xxxx xxxx 
xxxx xxxx xxxx xxxx xxxx xxxx xxxx 
xxxxxxxxxxxxxxxxxxxxx

answered Oct 19 '22 16:10

franklinsijo

Related questions
                            
                                Apache Spark Handling Skewed Data
                            
                                Avoid starting HiveThriftServer2 with created context programmatically
                            
                                Any scalable OLAP database (web app scale)?
                            
                                error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol
                            
                                Setting external jars to hadoop classpath
                            
                                Hadoop Resource Manager not starting
                            
                                Merging multiple LZO compressed files on HDFS
                            
                                Getting existing mapreduce job from cluster (the job could be running or completed)
                            
                                Distributed Job scheduling, management, and reporting
                            
                                Write pandas table to impala
                            
                                Remove Empty Partitions from Spark RDD
                            
                                Spark 1.5.2 and SLF4J StaticLoggerBinder
                            
                                Kafka -> Flink DataStream -> MongoDB
                            
                                Spark Shell - __spark_libs__.zip does not exist
                            
                                Integrate key-value database with Spark
                            
                                Error in hadoop jobs due to hive query error
                            
                                hadoop map reduce job with HDFS input and HBASE output
                            
                                SQOOP SQLSERVER Failed to load driver " appropriate connection manager is not being set"
                            
                                what are the differences zookeeper, journal node tasks and quorum journal manager in hadoop?
                            
                                Spark Job running on Yarn Cluster java.io.FileNotFoundException: File does not exits , eventhough the file exits on the master node

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Working with input splits(HADOOP)

Tags:

hadoop

hadoop2

mapreduce

User9523

People also ask

1 Answers

franklinsijo

Recent Activity

Donate For Us