Would you have any hints on what would be the best way to deal with files containing JSON entries and Hadoop?
JSON records contain JSON files where each line is its own JSON datum. In the case of JSON files, metadata is stored and the file is also splittable but again it also doesn’t support block compression. The only issue is there is not much support in Hadoop for JSON file but thanks to the third party tools which helps a lot.
In the case of JSON files, metadata is stored and the file is also splittable but again it also doesn’t support block compression. The only issue is there is not much support in Hadoop for JSON file but thanks to the third party tools which helps a lot.
Since the JSON files are expected to be in HDFS, we can leverage the HdfsDataFragmenter and HdfsAnalyzer. These classes are very generic and will fragment and analyze all files stored in HDFS, regardless of actual data format underneath.
This Input file formats in Hadoop is the 7th chapter in HDFS Tutorial Series. There are mainly 7 file formats supported by Hadoop. We will see each one in detail here- 1. Text/CSV Files 2. JSON Records 3. Avro Files 4. Sequence Files 5. RC Files 6. ORC Files 7. Parquet Files
There's a nice article on this from the Hadoop in Practice book:
Twitter's elephant-bird library has a JsonStringToMap class which you can use with Pig.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With