Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How Spark read file with underline the beginning of the file name?

When I use Spark to parse log files, I notice that if the first character of filename is _ , the result will be empty. Here is my test code:

SparkSession spark = SparkSession
  .builder()
  .appName("TestLog")
  .master("local")
  .getOrCreate();
JavaRDD<String> input = spark.read().text("D:\\_event_2.log").javaRDD();
System.out.println("size : " + input.count());

If I modify the file name to event_2.log, the code will run it correctly. I found that the text function is defined as:

@scala.annotation.varargs
def text(paths: String*): Dataset[String] = {
  format("text").load(paths : _*).as[String](sparkSession.implicits.newStringEncoder)
}

I think it could be due to _ being scala's placeholder. How can I avoid this problem?

like image 785
iameven Avatar asked Jul 20 '16 09:07

iameven


People also ask

How do I read a text file with Spark?

text("file_name") to read a file or directory of text files into a Spark DataFrame, and dataframe. write(). text("path") to write to a text file. When reading a text file, each line becomes each row that has string “value” column by default.

What does First () do in Spark?

In Spark, the First function always returns the first element of the dataset. It is similar to take(1).

How do I read different file formats in Spark?

Also, like any other file system, we can read and write TEXT, CSV, Avro, Parquet and JSON files into HDFS. Spark RDD natively supports reading text files and later with DataFrame, Spark added different data sources like CSV, JSON, Avro, and Parquet.


1 Answers

This has nothing to do with Scala. Spark uses Hadoop Input API to read file, which ignore every file that starts with underscore(_) or dot (.)

I don't know how to disable this in Spark though.

like image 102
Kien Truong Avatar answered Oct 03 '22 02:10

Kien Truong