I'm using Apache Spark 2.3.0. When I upload a csv file and then I put df.show it shows me the table with all null values and I would like to know why because everything looks fine in the csv
val df = sqlContext.read.format("com.databricks.spark.csv").option("header","true").schema(schema).load("data.csv")
val schema = StructType(Array(StructField("Rank",StringType,true),StructField("Grade", StringType, true),StructField("Channelname",StringType,true),StructField("Video Uploads",IntegerType,true), StructField("Suscribers",IntegerType,true),StructField("Videoviews",IntegerType,true)))
Rank,Grade,Channelname,VideoUploads,Subscribers,Videoviews
1st,A++ ,Zee TV,82757,18752951,20869786591
2nd,A++ ,T-Series,12661,61196302,47548839843
3rd,A++ ,Cocomelon - Nursery Rhymes,373,19238251,9793305082
4th,A++ ,SET India,27323,31180559,22675948293
5th,A++ ,WWE,36756,32852346,26273668433
6th,A++ ,Movieclips,30243,17149705,16618094724
7th,A++ ,netd müzik,8500,11373567,23898730764
8th,A++ ,ABS-CBN Entertainment,100147,12149206,17202609850
9th,A++ ,Ryan ToysReview,1140,16082927,24518098041
10th,A++ ,Zee Marathi,74607,2841811,2591830307
11th,A+ ,5-Minute Crafts,2085,33492951,8587520379
12th,A+ ,Canal KondZilla,822,39409726,19291034467
13th,A+ ,Like Nastya Vlog,150,7662886,2540099931
14th,A+ ,Ozuna,50,18824912,8727783225
15th,A+ ,Wave Music,16119,15899764,10989179147
16th,A+ ,Ch3Thailand,49239,11569723,9388600275
17th,A+ ,WORLDSTARHIPHOP,4778,15830098,11102158475
18th,A+ ,Vlad and Nikita,53,-- ,1428274554
The Spark csv () method demonstrates that null is used for values that are unknown or missing when files are read into DataFrames.
This article shows you how to filter NULL/None values from a Spark data frame using Scala. Function DataFrame.filter or DataFrame.where can be used to filter out null values. Function filter is alias name for where function.
Native Spark code cannot always be used and sometimes you’ll need to fall back on Scala code and User Defined Functions. The Scala best practices for null are different than the Spark null best practices. David Pollak, the author of Beginning Scala, stated “Ban null from any of your code.
Spark Replace NULL Values with Zero (0) Spark fill (value:Long) signatures that are available in DataFrameNaFunctions is used to replace NULL values with numeric values either zero (0) or any constant value for all integer and long datatype columns of Spark DataFrame or Dataset.
So if we load without a schema we see the following:
scala> val df = spark.read.format("com.databricks.spark.csv").option("header","true").load("data.csv")
df: org.apache.spark.sql.DataFrame = [Rank: string, Grade: string ... 4 more fields]
scala> df.show
+----+-----+--------------------+------------+-----------+-----------+
|Rank|Grade| Channelname|VideoUploads|Subscribers| Videoviews|
+----+-----+--------------------+------------+-----------+-----------+
| 1st| A++ | Zee TV| 82757| 18752951|20869786591|
| 2nd| A++ | T-Series| 12661| 61196302|47548839843|
| 3rd| A++ |Cocomelon - Nurse...| 373| 19238251| 9793305082|
| 4th| A++ | SET India| 27323| 31180559|22675948293|
| 5th| A++ | WWE| 36756| 32852346|26273668433|
| 6th| A++ | Movieclips| 30243| 17149705|16618094724|
| 7th| A++ | netd müzik| 8500| 11373567|23898730764|
| 8th| A++ |ABS-CBN Entertain...| 100147| 12149206|17202609850|
| 9th| A++ | Ryan ToysReview| 1140| 16082927|24518098041|
|10th| A++ | Zee Marathi| 74607| 2841811| 2591830307|
|11th| A+ | 5-Minute Crafts| 2085| 33492951| 8587520379|
|12th| A+ | Canal KondZilla| 822| 39409726|19291034467|
|13th| A+ | Like Nastya Vlog| 150| 7662886| 2540099931|
|14th| A+ | Ozuna| 50| 18824912| 8727783225|
|15th| A+ | Wave Music| 16119| 15899764|10989179147|
|16th| A+ | Ch3Thailand| 49239| 11569723| 9388600275|
|17th| A+ | WORLDSTARHIPHOP| 4778| 15830098|11102158475|
|18th| A+ | Vlad and Nikita| 53| -- | 1428274554|
+----+-----+--------------------+------------+-----------+-----------+
If we apply your schema we see this:
scala> val schema = StructType(Array(StructField("Rank",StringType,true),StructField("Grade", StringType, true),StructField("Channelname",StringType,true),StructField("Video Uploads",IntegerType,true), StructField("Suscribers",IntegerType,true),StructField("Videoviews",IntegerType,true)))
scala> val df = spark.read.format("com.databricks.spark.csv").option("header","true").schema(schema).load("data.csv")
df: org.apache.spark.sql.DataFrame = [Rank: string, Grade: string ... 4 more fields]
scala> df.show
+----+-----+-----------+-------------+----------+----------+
|Rank|Grade|Channelname|Video Uploads|Suscribers|Videoviews|
+----+-----+-----------+-------------+----------+----------+
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
+----+-----+-----------+-------------+----------+----------+
Now if we look at your data we see Subscribers contains non Integer values ("--") and Videoviews contains values which exceed Integer max value (2,147,483,647)
So if we change the schema to conform with the data:
scala> val schema = StructType(Array(StructField("Rank",StringType,true),StructField("Grade", StringType, true),StructField("Channelname",StringType,true),StructField("Video Uploads",IntegerType,true), StructField("Suscribers",StringType,true),StructField("Videoviews",LongType,true)))
schema: org.apache.spark.sql.types.StructType = StructType(StructField(Rank,StringType,true), StructField(Grade,StringType,true), StructField(Channelname,StringType,true), StructField(Video Uploads,IntegerType,true), StructField(Suscribers,StringType,true), StructField(Videoviews,LongType,true))
scala> val df = spark.read.format("com.databricks.spark.csv").option("header","true").schema(schema).load("data.csv")
df: org.apache.spark.sql.DataFrame = [Rank: string, Grade: string ... 4 more fields]
scala> df.show
+----+-----+--------------------+-------------+----------+-----------+
|Rank|Grade| Channelname|Video Uploads|Suscribers| Videoviews|
+----+-----+--------------------+-------------+----------+-----------+
| 1st| A++ | Zee TV| 82757| 18752951|20869786591|
| 2nd| A++ | T-Series| 12661| 61196302|47548839843|
| 3rd| A++ |Cocomelon - Nurse...| 373| 19238251| 9793305082|
| 4th| A++ | SET India| 27323| 31180559|22675948293|
| 5th| A++ | WWE| 36756| 32852346|26273668433|
| 6th| A++ | Movieclips| 30243| 17149705|16618094724|
| 7th| A++ | netd müzik| 8500| 11373567|23898730764|
| 8th| A++ |ABS-CBN Entertain...| 100147| 12149206|17202609850|
| 9th| A++ | Ryan ToysReview| 1140| 16082927|24518098041|
|10th| A++ | Zee Marathi| 74607| 2841811| 2591830307|
|11th| A+ | 5-Minute Crafts| 2085| 33492951| 8587520379|
|12th| A+ | Canal KondZilla| 822| 39409726|19291034467|
|13th| A+ | Like Nastya Vlog| 150| 7662886| 2540099931|
|14th| A+ | Ozuna| 50| 18824912| 8727783225|
|15th| A+ | Wave Music| 16119| 15899764|10989179147|
|16th| A+ | Ch3Thailand| 49239| 11569723| 9388600275|
|17th| A+ | WORLDSTARHIPHOP| 4778| 15830098|11102158475|
|18th| A+ | Vlad and Nikita| 53| -- | 1428274554|
+----+-----+--------------------+-------------+----------+-----------+
The reason for the null
values is because the default "mode" for the csv API is PERMISSIVE
:
mode (default PERMISSIVE): allows a mode for dealing with corrupt records during parsing. It supports the following case-insensitive modes.
- PERMISSIVE : sets other fields to null when it meets a corrupted record, and puts the malformed string into a field configured by columnNameOfCorruptRecord. To keep corrupt records, an user can set a string type field named columnNameOfCorruptRecord in an user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. When a length of parsed CSV tokens is shorter than an expected length of a schema, it sets null for extra fields.
- DROPMALFORMED : ignores the whole corrupted records.
- FAILFAST : throws an exception when it meets corrupted records
csv API
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With