I'm reading a csv file to dataframe <pre class="prettyprint"><code>datafram = spark.read.csv(fileName, header=True) </code></pre> but the data type in datafram is String, I want to change data type to float. Is there any way to do this efficiently?

The most straightforward way to achieve this is by casting. <pre class="prettyprint"><code>dataframe = dataframe.withColumn("float", col("column").cast("double")) </code></pre>

If you want to do the casting when reading the CSV, you can use the <code>inferSchema</code> argument when reading the data. Let's try with a a small test csv file: <pre class="prettyprint"><code>$ cat ../data/test.csv a,b,c,d 5.0, 1.0, 1.0, 3.0 2.0, 0.0, 3.0, 4.0 4.0, 0.0, 0.0, 6.0 </code></pre> Now, if we read it as you did, we will have string values: <pre class="prettyprint"><code>>>> df_csv = spark.read.csv("../data/test.csv", header=True) >>> print(df_csv.dtypes) [('a', 'string'), ('b', 'string'), ('c', 'string'), ('d', 'string')] </code></pre> However, if we set <code>inferSchema</code> to <code>True</code>, it will correctly identify them as doubles: <pre class="prettyprint"><code>>>> df_csv2 = spark.read.csv("../data/test.csv", header=True, inferSchema=True) >>> print(df_csv2.dtypes) [('a', 'double'), ('b', 'double'), ('c', 'double'), ('d', 'double')] </code></pre> However, this approach requires another run over the data. You can find more information on the DataFrameReader CSV documentation.

PYSPARK : casting string to float when reading a csv file

Tags:

python

apache-spark

pyspark

I'm reading a csv file to dataframe

datafram = spark.read.csv(fileName, header=True)

but the data type in datafram is String, I want to change data type to float. Is there any way to do this efficiently?

866

asked Oct 07 '16 19:10

Alex

2 Answers

The most straightforward way to achieve this is by casting.

dataframe = dataframe.withColumn("float", col("column").cast("double"))

answered Sep 28 '22 01:09

Alberto Bonsanto

If you want to do the casting when reading the CSV, you can use the inferSchema argument when reading the data. Let's try with a a small test csv file:

$ cat ../data/test.csv
a,b,c,d
5.0, 1.0, 1.0, 3.0
2.0, 0.0, 3.0, 4.0
4.0, 0.0, 0.0, 6.0

Now, if we read it as you did, we will have string values:

>>> df_csv = spark.read.csv("../data/test.csv", header=True)
>>> print(df_csv.dtypes)
[('a', 'string'), ('b', 'string'), ('c', 'string'), ('d', 'string')]

However, if we set inferSchema to True, it will correctly identify them as doubles:

>>> df_csv2 = spark.read.csv("../data/test.csv", header=True, inferSchema=True)
>>> print(df_csv2.dtypes)
[('a', 'double'), ('b', 'double'), ('c', 'double'), ('d', 'double')]

However, this approach requires another run over the data. You can find more information on the DataFrameReader CSV documentation.

answered Sep 28 '22 03:09

Mikel

Related questions
                            
                                Pycharm manage.py autocomplete error
                            
                                Delete pandas group based on condition
                            
                                matplotlib does not show legend in scatter plot
                            
                                Python: Chunking others than noun phrases (e.g. prepositional) using Spacy, etc
                            
                                Alternative for numpy.choose that allows an arbitrary or at least more than 32 arguments?
                            
                                Pandas : Delete rows based on other rows
                            
                                pytest logbook logging to file and stdout
                            
                                how to install libhdf5-dev? (without yum, rpm nor apt-get)
                            
                                Splitting a list into uneven tuples
                            
                                python import * or a list from other level
                            
                                pandas dataframe: len(df) is not equal to number of iterations in df.iterrows()
                            
                                efficient loop over numpy array
                            
                                How to modify variables in another python file?
                            
                                Merging with more than one level overlap not allowed
                            
                                assertRaises in unittest not catching Exception properly
                            
                                How to make a menu in Python navigable with arrow keys
                            
                                Why exhausted generators raise StopIteration more than once?
                            
                                Is there a way to remove proper nouns from a sentence using python?
                            
                                Create mask from skimage contour
                            
                                Pycharm: set environment variable for google service account key (json credential)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With