Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why the types are all string while load csv to pyspark dataframe?

I have a csv file which contains numbers (no string in it). It has int and float type. But when I read it in pyspark in this way:

df = spark.read.csv("s3://s3-cdp-prod-hive/novaya/instacart/data.csv",header=False)

all the columns' type of the dataframe are string.

How to read it into numbers with int and float automatically?

Some columns contain nan in it. In file it is represented by nan

0.18277,-0.188931,0.0893389,0.119931,0.318853,-0.132933,-0.0288816,0.136137,0.12939,-0.245342,0.0608182,0.0802028,-0.00625962,0.271222,0.187855,0.132606,-0.0451533,0.140501,0.0704631,0.0229986,-0.0533376,-0.319643,-0.029321,-0.160937,0.608359,0.0513554,-0.246744,0.0817331,-0.410682,0.210652,0.375154,0.021617,0.119288,0.0674939,0.190642,0.161885,0.0385196,-0.341168,0.138659,-0.236908,0.230963,0.23714,-0.277465,0.242136,0.0165013,0.0462388,0.259744,-0.397228,-0.0143719,0.0891644,0.222225,0.0987765,0.24049,0.357596,-0.106266,-0.216665,0.191123,-0.0164234,0.370766,0.279462,0.46796,-0.0835098,0.112693,0.231951,-0.0942302,-0.178815,0.259096,-0.129323,1165491,175882,16.5708805975,6,0,2.80890261184,4.42114773551,0,23,0,13.4645462866,18.0359037455,11,30.0,0.0,11.4435397208,84.7504967125,30.0,5370,136.0,1.0,9.61508192633,62.2006926209,1,0,0,22340,9676,322.71241867,17.7282900627,1,100,4.24701125287,2.72260519248,0,6,17.9743048247,13.3241271262,0,23,82.4988407009,11.4021333588,0.0,30.0,45.1319021862,7.76284691137,1.0,66.0,9.40127026245,2.30880529144,1,73,0.113021725659,0.264843289305,0.0,0.986301369863,1,30450,0
like image 698
yanachen Avatar asked Jun 19 '17 08:06

yanachen


1 Answers

As you can see here:

inferSchema – infers the input schema automatically from data. It requires one extra pass over the data. If None is set, it uses the default value, false.

For NaN values, refer to the same docs above:

nanValue – sets the string representation of a non-number value. If None is set, it uses the default value, NaN

By setting inferSchema as True, you will obtain a dataframe with types infered.

Here I put an example:

CSV file:

12,5,8,9
1.0,3,46,NaN

By default, inferSchema is False and all values are String:

from pyspark.sql.types import *

>>> df = spark.read.csv("prova.csv",header=False) 
>>> df.dtypes
[('_c0', 'string'), ('_c1', 'string'), ('_c2', 'string'), ('_c3', 'string')]

>>> df.show()
+---+---+---+---+
|_c0|_c1|_c2|_c3|
+---+---+---+---+
| 12|  5|  8|  9|
|1.0|  3| 46|NaN|
+---+---+---+---+

If you set inferSchema as True:

>>> df = spark.read.csv("prova.csv",inferSchema =True,header=False) 
>>> df.dtypes
[('_c0', 'double'), ('_c1', 'int'), ('_c2', 'int'), ('_c3', 'double')]


>>> df.show()
+----+---+---+---+
| _c0|_c1|_c2|_c3|
+----+---+---+---+
|12.0|  5|  8|9.0|
| 1.0|  3| 46|NaN|
+----+---+---+---+
like image 129
titiro89 Avatar answered Nov 17 '22 09:11

titiro89