Give an HDFS path, how to figure out what format it is (text, sequence or parquet)?
Your answer You can use the Hadoop filesystem command to read any file. It supports the cat command to read the content.
Usage: hadoop fs -ls [-d] [-h] [-R] [-t] [-S] [-r] [-u] <args> Options: -d: Directories are listed as plain files. -h: Format file sizes in a human-readable fashion (eg 64.0m instead of 67108864). -R: Recursively list subdirectories encountered. -t: Sort output by modification time (most recent first).
Standard Hadoop Storage File Formats Some standard file formats are text files (CSV,XML) or binary files(images). Text Data - These data come in the form of CSV or unstructured data such as twitters. CSV files commonly used for exchanging data between Hadoop and external systems.
I think it's not easy to accomplish your demand, unless all your files inside HDFS follow some conventions, e.g. .txt
for text, .seq
fro sequence and .parquet
for parquet file.
However, you could check your file manually using cat
.
HDFS cat: hadoop dfs -cat /path/to/file | head
to check if it's a text file.
Parquet head: parquet-tools head [option...] /path/to/file
or, write a program to read....
use "hdfs dfs -cat /path/to/file | head ",
1) for orc file, the command can print the "ORC" flag in the first line
2) for parquet file, the command can print the "PAR1" flag in the first line
3) for text file, the command can print the all the content of file
String extension = FilenameUtils.getExtension("hdfs://path-to-file"); Working with Hadoop 2.5.2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With