When I use Spark to parse log files, I notice that if the first character of filename is _
, the result will be empty. Here is my test code:
SparkSession spark = SparkSession
.builder()
.appName("TestLog")
.master("local")
.getOrCreate();
JavaRDD<String> input = spark.read().text("D:\\_event_2.log").javaRDD();
System.out.println("size : " + input.count());
If I modify the file name to event_2.log
, the code will run it correctly.
I found that the text
function is defined as:
@scala.annotation.varargs
def text(paths: String*): Dataset[String] = {
format("text").load(paths : _*).as[String](sparkSession.implicits.newStringEncoder)
}
I think it could be due to _
being scala's placeholder
. How can I avoid this problem?
text("file_name") to read a file or directory of text files into a Spark DataFrame, and dataframe. write(). text("path") to write to a text file. When reading a text file, each line becomes each row that has string “value” column by default.
In Spark, the First function always returns the first element of the dataset. It is similar to take(1).
Also, like any other file system, we can read and write TEXT, CSV, Avro, Parquet and JSON files into HDFS. Spark RDD natively supports reading text files and later with DataFrame, Spark added different data sources like CSV, JSON, Avro, and Parquet.
This has nothing to do with Scala. Spark uses Hadoop Input API to read file, which ignore every file that starts with underscore(_
) or dot (.
)
I don't know how to disable this in Spark though.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With