I have a csv file which contains numbers (no string in it). It has int and float type. But when I read it in pyspark in this way:
df = spark.read.csv("s3://s3-cdp-prod-hive/novaya/instacart/data.csv",header=False)
all the columns' type of the dataframe are string.
How to read it into numbers with int and float automatically?
Some columns contain nan in it. In file it is represented by nan
0.18277,-0.188931,0.0893389,0.119931,0.318853,-0.132933,-0.0288816,0.136137,0.12939,-0.245342,0.0608182,0.0802028,-0.00625962,0.271222,0.187855,0.132606,-0.0451533,0.140501,0.0704631,0.0229986,-0.0533376,-0.319643,-0.029321,-0.160937,0.608359,0.0513554,-0.246744,0.0817331,-0.410682,0.210652,0.375154,0.021617,0.119288,0.0674939,0.190642,0.161885,0.0385196,-0.341168,0.138659,-0.236908,0.230963,0.23714,-0.277465,0.242136,0.0165013,0.0462388,0.259744,-0.397228,-0.0143719,0.0891644,0.222225,0.0987765,0.24049,0.357596,-0.106266,-0.216665,0.191123,-0.0164234,0.370766,0.279462,0.46796,-0.0835098,0.112693,0.231951,-0.0942302,-0.178815,0.259096,-0.129323,1165491,175882,16.5708805975,6,0,2.80890261184,4.42114773551,0,23,0,13.4645462866,18.0359037455,11,30.0,0.0,11.4435397208,84.7504967125,30.0,5370,136.0,1.0,9.61508192633,62.2006926209,1,0,0,22340,9676,322.71241867,17.7282900627,1,100,4.24701125287,2.72260519248,0,6,17.9743048247,13.3241271262,0,23,82.4988407009,11.4021333588,0.0,30.0,45.1319021862,7.76284691137,1.0,66.0,9.40127026245,2.30880529144,1,73,0.113021725659,0.264843289305,0.0,0.986301369863,1,30450,0
As you can see here:
inferSchema – infers the input schema automatically from data. It requires one extra pass over the data. If None is set, it uses the default value, false.
For NaN values, refer to the same docs above:
nanValue – sets the string representation of a non-number value. If None is set, it uses the default value, NaN
By setting inferSchema as True, you will obtain a dataframe with types infered.
Here I put an example:
CSV file:
12,5,8,9
1.0,3,46,NaN
By default, inferSchema is False and all values are String:
from pyspark.sql.types import *
>>> df = spark.read.csv("prova.csv",header=False)
>>> df.dtypes
[('_c0', 'string'), ('_c1', 'string'), ('_c2', 'string'), ('_c3', 'string')]
>>> df.show()
+---+---+---+---+
|_c0|_c1|_c2|_c3|
+---+---+---+---+
| 12| 5| 8| 9|
|1.0| 3| 46|NaN|
+---+---+---+---+
If you set inferSchema as True:
>>> df = spark.read.csv("prova.csv",inferSchema =True,header=False)
>>> df.dtypes
[('_c0', 'double'), ('_c1', 'int'), ('_c2', 'int'), ('_c3', 'double')]
>>> df.show()
+----+---+---+---+
| _c0|_c1|_c2|_c3|
+----+---+---+---+
|12.0| 5| 8|9.0|
| 1.0| 3| 46|NaN|
+----+---+---+---+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With