I'm working on a spark mllib algorithm. The dataset I have is in this form
Company":"XXXX","CurrentTitle":"XYZ","Edu_Title":"ABC","Exp_mnth":.(there are more values similar to these)
Im trying to raw code String values to Numeric values. So, I tried using zipwithuniqueID for unique value for each of the string values.For some reason I'm not able to save the modified dataset to the disk. Can I do this in any way using spark SQL? or what would be the better approach for this?
The row_number() is a window function in Spark SQL that assigns a row number (sequence number) to each row in the result Dataset. This function is used with Window. partitionBy() which partitions the data into windows frames and orderBy() clause to sort the rows in each partition.
Use monotonically_increasing_id() for unique, but not consecutive numbers. The monotonically_increasing_id() function generates monotonically increasing 64-bit integers. The generated id numbers are guaranteed to be increasing and unique, but they are not guaranteed to be consecutive.
In PySpark, to add a new column to DataFrame use lit() function by importing from pyspark. sql. functions import lit , lit() function takes a constant value you wanted to add and returns a Column type, if you wanted to add a NULL / None use lit(None) .
Update the column value Spark withColumn() function of the DataFrame is used to update the value of a column. withColumn() function takes 2 arguments; first the column you wanted to update and the second the value you wanted to update with.
Scala
val dataFrame1 = dataFrame0.withColumn("index",monotonically_increasing_id())
Java
Import org.apache.spark.sql.functions;
Dataset<Row> dataFrame1 = dataFrame0.withColumn("index",functions.monotonically_increasing_id());
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With