How to convert a column that has been read as a string into a column of arrays? i.e. convert from below schema <pre class="prettyprint"><code>scala> test.printSchema root |-- a: long (nullable = true) |-- b: string (nullable = true) +---+---+ | a| b| +---+---+ | 1|2,3| +---+---+ | 2|4,5| +---+---+ </code></pre> To: <pre class="prettyprint"><code>scala> test1.printSchema root |-- a: long (nullable = true) |-- b: array (nullable = true) | |-- element: long (containsNull = true) +---+-----+ | a| b | +---+-----+ | 1|[2,3]| +---+-----+ | 2|[4,5]| +---+-----+ </code></pre> Please share both scala and python implementation if possible. On a related note, how do I take care of it while reading from the file itself? I have data with ~450 columns and few of them I want to specify in this format. Currently I am reading in pyspark as below: <pre class="prettyprint"><code>df = spark.read.format('com.databricks.spark.csv').options( header='true', inferschema='true', delimiter='|').load(input_file) </code></pre> Thanks.

There are various method, The best way to do is using <code>split</code> function and cast to <code>array<long></code> <pre class="prettyprint"><code>data.withColumn("b", split(col("b"), ",").cast("array<long>")) </code></pre> You can also create simple udf to convert the values <pre class="prettyprint"><code>val tolong = udf((value : String) => value.split(",").map(_.toLong)) data.withColumn("newB", tolong(data("b"))).show </code></pre> Hope this helps!

Spark: Convert column of string to an array

Tags:

How to convert a column that has been read as a string into a column of arrays? i.e. convert from below schema

scala> test.printSchema
root
 |-- a: long (nullable = true)
 |-- b: string (nullable = true)

+---+---+
|  a|  b|
+---+---+
|  1|2,3|
+---+---+
|  2|4,5|
+---+---+

To:

scala> test1.printSchema
root
 |-- a: long (nullable = true)
 |-- b: array (nullable = true)
 |    |-- element: long (containsNull = true)

+---+-----+
|  a|  b  |
+---+-----+
|  1|[2,3]|
+---+-----+
|  2|[4,5]|
+---+-----+

Please share both scala and python implementation if possible. On a related note, how do I take care of it while reading from the file itself? I have data with ~450 columns and few of them I want to specify in this format. Currently I am reading in pyspark as below:

df = spark.read.format('com.databricks.spark.csv').options(
    header='true', inferschema='true', delimiter='|').load(input_file)

Thanks.

602

asked Jun 22 '17 04:06

Nikhil Utane

1 Answers

There are various method,

The best way to do is using split function and cast to array<long>

data.withColumn("b", split(col("b"), ",").cast("array<long>"))

You can also create simple udf to convert the values

val tolong = udf((value : String) => value.split(",").map(_.toLong))

data.withColumn("newB", tolong(data("b"))).show

Hope this helps!

151

answered Sep 20 '22 15:09

koiralo

Related questions
                            
                                Is there a way to make a console application run using only a single file in .NET Core?
                            
                                Best way to arrange labels and inputs side-by-side
                            
                                Compiling OpenCV 3.3 : C++11 is not supported
                            
                                CASE...WHEN in WHERE clause in Postgresql
                            
                                RetrieveAPIView without lookup field?
                            
                                Lazy evaluation for logging in Java 8
                            
                                How to use Enum with SQLAlchemy and Alembic?
                            
                                Error: app is in background uid null
                            
                                Typescript object: How do I restrict the keys to specific strings?
                            
                                ERROR in : No provider for NgControl Angular AOT
                            
                                Remove Daemonset pod from a node
                            
                                What is the expected behaviour of spring @scheduled cron when jobs would overlap?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With