Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can you update values in a dataset?

So as far as I know Apache Spark doesn't has a functionality that imitates the update SQL command. Like, I can change a single value in a column given a certain condition. The only way around that is to use the following command I was instructed to use (here in Stackoverflow): withColumn(columnName, where('condition', value));

However, the condition should be of column type, meaning I have to use the built in column filtering functions apache has (equalTo, isin, lt, gt, etc). Is there a way I can instead use an SQL statement instead of those built in functions?

The problem is I'm given a text file with SQL statements, like WHERE ID > 5 or WHERE AGE != 50, etc. Then I have to label values based on those conditions, and I thought of following the withColumn() approach but I can't plug-in an SQL statement in that function. Any idea of how I can go around this?

like image 486
Guillermo Herrera Avatar asked Aug 08 '17 19:08

Guillermo Herrera


People also ask

Can we change the data of DataSet?

After a dataset exists, you can modify its content but not its structure, meaning that you can add, modify, and delete data but not add or delete columns. If you need to modify the column structure or the data type of a column, you can perform the following steps: Export the dataset to a file.

What is the Update () method of DataSet used for?

The Update method retrieves rows from the table listed in the first mapping before performing an update. The Update then refreshes the row using the value of the UpdatedRowSource property.

How do you retrieve and Update data from ADO NET explain?

The . NET Framework data providers of ADO.NET serve as a bridge between an application and a data source, allowing you to execute commands as well as to retrieve data by using a DataReader or a DataAdapter. A key function of any database application is the ability to update the data that is stored in the database.


1 Answers

I found a way to go around this:

You want to split your dataset into two sets: the values you want to update and the values you don't want to update

Dataset<Row> valuesToUpdate = dataset.filter('conditionToFilterValues');
Dataset<Row> valuesNotToUpdate = dataset.except(valuesToUpdate);

valueToUpdate = valueToUpdate.withColumn('updatedColumn', lit('updateValue'));

Dataset<Row> updatedDataset = valuesNotToUpdate.union(valueToUpdate);

This, however, doesn't keep the same order of records as the original dataset, so if order is of importance to you, this won't suffice your needs.

In PySpark you have to use .subtract instead of .except

like image 104
Guillermo Herrera Avatar answered Nov 01 '22 09:11

Guillermo Herrera