Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can Spark SQL not count correctly or can I not write SQL correctly?

Tags:

In a Python notebook on Databricks "Community Edition", I'm experimenting with the City of San Francisco open data about emergency calls to 911 requesting firefighters. (The old 2016 copy of the data used in "Using Apache Spark 2.0 to Analyze the City of San Francisco's Open Data" (YouTube) and made available on S3 for that tutorial.)

After mounting the data and reading it with the explicitly defined schema into a DataFrame fire_service_calls_df, I aliased that DataFrame as an SQL table:

sqlContext.registerDataFrameAsTable(fire_service_calls_df, "fireServiceCalls")

With that and the DataFrame API, I can count the call types that occurred:

fire_service_calls_df.select('CallType').distinct().count()
Out[n]: 34

... or with SQL in Python:

spark.sql("""
SELECT count(DISTINCT CallType)
FROM fireServiceCalls
""").show()
+------------------------+
|count(DISTINCT CallType)|
+------------------------+
|                      33|
+------------------------+

... or with an SQL cell:

%sql

SELECT count(DISTINCT CallType)
FROM fireServiceCalls

Databricks table output with column "count(DISTINCT CallType)" and single value "33".

Why do I get two different count results? (It seems like 34 is the correct one, even though the talk in the video and the accompanying tutorial notebook mention "35".)

like image 992
das-g Avatar asked Mar 13 '18 20:03

das-g


People also ask

What is the difference between Spark SQL and SQL?

Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine. It enables unmodified Hadoop Hive queries to run up to 100x faster on existing deployments and data.

Can Spark use regular SQL statements?

Spark SQL allows us to query structured data inside Spark programs, using SQL or a DataFrame API which can be used in Java, Scala, Python and R.

Which of the following is true for Spark SQL?

15. Which of the following is true for Spark core? It enables users to run SQL / HQL queries on the top of Spark.

Is Spark SQL more efficient?

These facts lead us to the conclusion that Data Professional using Big SQL are 3x more productive than those using Spark SQL.


1 Answers

To answer the question

Can Spark SQL not count correctly or can I not write SQL correctly?

from the title: I can't write SQL correctly.

Rule <insert number> of writing SQL: Think about NULL and UNDEFINED.

%sql
SELECT count(*)
FROM (
  SELECT DISTINCT CallType
  FROM fireServiceCalls 
)

34

Also, i apparently can't read:

pault suggested in a comment

With only 30 something values, you could just sort and print all the distinct items to see where the difference is.

Well, I actually thought of that myself. (Minus the sorting.) Except, there wasn't any difference, there were always 34 call types in the output, whether I generated it with SQL or DataFrame queries. I simply didn't notice that one of them was ominously named null:

+--------------------------------------------+
|CallType                                    |
+--------------------------------------------+
|Elevator / Escalator Rescue                 |
|Marine Fire                                 |
|Aircraft Emergency                          |
|Confined Space / Structure Collapse         |
|Administrative                              |
|Alarms                                      |
|Odor (Strange / Unknown)                    |
|Lightning Strike (Investigation)            |
|null                                        |
|Citizen Assist / Service Call               |
|HazMat                                      |
|Watercraft in Distress                      |
|Explosion                                   |
|Oil Spill                                   |
|Vehicle Fire                                |
|Suspicious Package                          |
|Train / Rail Fire                           |
|Extrication / Entrapped (Machinery, Vehicle)|
|Other                                       |
|Transfer                                    |
|Outside Fire                                |
|Traffic Collision                           |
|Assist Police                               |
|Gas Leak (Natural and LP Gases)             |
|Water Rescue                                |
|Electrical Hazard                           |
|High Angle Rescue                           |
|Structure Fire                              |
|Industrial Accidents                        |
|Medical Incident                            |
|Mutual Aid / Assist Outside Agency          |
|Fuel Spill                                  |
|Smoke Investigation (Outside)               |
|Train / Rail Incident                       |
+--------------------------------------------+
like image 176
das-g Avatar answered Sep 22 '22 13:09

das-g