Why the types are all string while load csv to pyspark dataframe?

Question

I have a csv file which contains numbers (no string in it). It has int and float type. But when I read it in pyspark in this way:

df = spark.read.csv("s3://s3-cdp-prod-hive/novaya/instacart/data.csv",header=False)

all the columns' type of the dataframe are string.

How to read it into numbers with int and float automatically?

Some columns contain nan in it. In file it is represented by nan

0.18277,-0.188931,0.0893389,0.119931,0.318853,-0.132933,-0.0288816,0.136137,0.12939,-0.245342,0.0608182,0.0802028,-0.00625962,0.271222,0.187855,0.132606,-0.0451533,0.140501,0.0704631,0.0229986,-0.0533376,-0.319643,-0.029321,-0.160937,0.608359,0.0513554,-0.246744,0.0817331,-0.410682,0.210652,0.375154,0.021617,0.119288,0.0674939,0.190642,0.161885,0.0385196,-0.341168,0.138659,-0.236908,0.230963,0.23714,-0.277465,0.242136,0.0165013,0.0462388,0.259744,-0.397228,-0.0143719,0.0891644,0.222225,0.0987765,0.24049,0.357596,-0.106266,-0.216665,0.191123,-0.0164234,0.370766,0.279462,0.46796,-0.0835098,0.112693,0.231951,-0.0942302,-0.178815,0.259096,-0.129323,1165491,175882,16.5708805975,6,0,2.80890261184,4.42114773551,0,23,0,13.4645462866,18.0359037455,11,30.0,0.0,11.4435397208,84.7504967125,30.0,5370,136.0,1.0,9.61508192633,62.2006926209,1,0,0,22340,9676,322.71241867,17.7282900627,1,100,4.24701125287,2.72260519248,0,6,17.9743048247,13.3241271262,0,23,82.4988407009,11.4021333588,0.0,30.0,45.1319021862,7.76284691137,1.0,66.0,9.40127026245,2.30880529144,1,73,0.113021725659,0.264843289305,0.0,0.986301369863,1,30450,0

titiro89 · Accepted Answer

As you can see here:

inferSchema – infers the input schema automatically from data. It requires one extra pass over the data. If None is set, it uses the default value, false.

For NaN values, refer to the same docs above:

nanValue – sets the string representation of a non-number value. If None is set, it uses the default value, NaN

By setting inferSchema as True, you will obtain a dataframe with types infered.

Here I put an example:

CSV file:

12,5,8,9
1.0,3,46,NaN

By default, inferSchema is False and all values are String:

from pyspark.sql.types import *

>>> df = spark.read.csv("prova.csv",header=False) 
>>> df.dtypes
[('_c0', 'string'), ('_c1', 'string'), ('_c2', 'string'), ('_c3', 'string')]

>>> df.show()
+---+---+---+---+
|_c0|_c1|_c2|_c3|
+---+---+---+---+
| 12|  5|  8|  9|
|1.0|  3| 46|NaN|
+---+---+---+---+

If you set inferSchema as True:

>>> df = spark.read.csv("prova.csv",inferSchema =True,header=False) 
>>> df.dtypes
[('_c0', 'double'), ('_c1', 'int'), ('_c2', 'int'), ('_c3', 'double')]


>>> df.show()
+----+---+---+---+
| _c0|_c1|_c2|_c3|
+----+---+---+---+
|12.0|  5|  8|9.0|
| 1.0|  3| 46|NaN|
+----+---+---+---+

Why the types are all string while load csv to pyspark dataframe?

Tags:

dataframe

pyspark

yanachen

1 Answers

titiro89

Recent Activity

Donate For Us

Why the types are all string while load csv to pyspark dataframe?

Tags:

dataframe

pyspark

yanachen

1 Answers

titiro89

Related questions

Recent Activity

Donate For Us