So as far as I know Apache Spark doesn't has a functionality that imitates the update SQL command. Like, I can change a single value in a column given a certain condition. The only way around that is to use the following command I was instructed to use (here in Stackoverflow): withColumn(columnName, where('condition', value));
However, the condition should be of column
type, meaning I have to use the built in column filtering functions apache has (equalTo
, isin
, lt
, gt
, etc). Is there a way I can instead use an SQL statement instead of those built in functions?
The problem is I'm given a text file with SQL statements, like WHERE ID > 5
or WHERE AGE != 50
, etc. Then I have to label values based on those conditions, and I thought of following the withColumn()
approach but I can't plug-in an SQL statement in that function. Any idea of how I can go around this?
After a dataset exists, you can modify its content but not its structure, meaning that you can add, modify, and delete data but not add or delete columns. If you need to modify the column structure or the data type of a column, you can perform the following steps: Export the dataset to a file.
The Update method retrieves rows from the table listed in the first mapping before performing an update. The Update then refreshes the row using the value of the UpdatedRowSource property.
The . NET Framework data providers of ADO.NET serve as a bridge between an application and a data source, allowing you to execute commands as well as to retrieve data by using a DataReader or a DataAdapter. A key function of any database application is the ability to update the data that is stored in the database.
I found a way to go around this:
You want to split your dataset into two sets: the values you want to update and the values you don't want to update
Dataset<Row> valuesToUpdate = dataset.filter('conditionToFilterValues');
Dataset<Row> valuesNotToUpdate = dataset.except(valuesToUpdate);
valueToUpdate = valueToUpdate.withColumn('updatedColumn', lit('updateValue'));
Dataset<Row> updatedDataset = valuesNotToUpdate.union(valueToUpdate);
This, however, doesn't keep the same order of records as the original dataset, so if order is of importance to you, this won't suffice your needs.
In PySpark you have to use .subtract instead of .except
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With