Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between Text and String in Hadoop

Tags:

What is the difference between org.apache.hadoop.io.Text and java.lang.String in the Hadoop framework?

Why couldn't they use String instead of introducing a new Text class?

I investigated the difference and found out it has to do with the encoding format; however I don't understand it yet.

Can someone explain the differences (with examples, if applicable)?

like image 520
Lokesh Avatar asked Nov 08 '13 04:11

Lokesh


People also ask

What is the difference between text and string?

Both a string and text field will hold information that you can freely write in. The major difference between the two fields is how many characters you can put in these fields. A string field has a limit of 255 characters, whereas a text field has a character limit of 30,000 characters.

What is the need of IntWritable?

Why does Hadoop need IntWritable instead of int? IntWritable is the Wrapper Class/Box Class in Hadoop similar to Integer Class in Java. IntWritable is the Hadoop flavour of Integer, which is optimized to provide serialization in Hadoop.


1 Answers

The binary representation of a Text object is a variable length integer containing the number of bytes in the UTF-8 representation of the string, followed by the UTF-8 bytes themselves.

Text is a replacement for the UTF8 class, which was deprecated because it didn’t support strings whose encoding was over 32,767 bytes, and because it used Java’s modified UTF-8.

Furthermore, Text uses standard UTF-8, which makes it potentially easier to inter operate with other tools that understand UTF-8.

Following are some of the differences in brief related to its functioning with respect to String:

Indexing: Because of its emphasis on using standard UTF-8, there are some differences between Text and the Java String class. Indexing for the Text class is in terms of position in the encoded byte sequence, not the Unicode character in the string, or the Java char code unit (as it is for String).

For instance, charAt() returns an int representing a Unicode code point, unlike the String variant that returns a char.

Iteration: Iterating over the Unicode characters in Text is complicated by the use of byte offsets for indexing, since you can’t just increment the index.

Mutable: Another difference with String is that Text is mutable (like all Writable implementations in Hadoop, except NullWritable, which is a singleton). You can reuse a Text instance by calling one of the set()methods on it.

Resorting to String:

Text doesn’t have as rich an API for manipulating strings as java.lang.String, so in many cases, you need to convert the Text object to a String. This is done in the usual way, using the toString() method:

For more details read definitive guide.

like image 136
SSaikia_JtheRocker Avatar answered Sep 22 '22 07:09

SSaikia_JtheRocker