I'm struggling a bit understanding spark and writing dataframes to a mysql database. I have the following code:
forecastDict = {'uuid': u'8df34d5a-ce02-4d02-b282-e10363690122', 'created_at': datetime.datetime(2014, 12, 31, 23, 0)}
forecastFrame = sqlContext.createDataFrame([forecastDict])
forecastFrame.write.jdbc(url="jdbc:mysql://example.com/example_db?user=bla&password=blabal123", table="example_table", mode="append")
The last line in the code throws the following error:
Incorrect datetime value: '8df34d5a-ce02-4d02-b282-e10363690122' for column 'created_at' at row 1
I can post the entire stack trace if necessary, but basically what's happening here is that the pyspark is mapping the uuid field to the wrong column in mysql. Here's the mysql definition:
mysql> show create table example_table;
...
CREATE TABLE `example_table` (
`uuid` varchar(36) NOT NULL,
`created_at` datetime NOT NULL,
PRIMARY KEY (`uuid`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
...
If we change the mysql definition to the following (notice that only the order of the columns is different):
CREATE TABLE `example_table` (
`created_at` datetime NOT NULL,
`uuid` varchar(36) NOT NULL,
PRIMARY KEY (`uuid`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
The insert works fine. Is there a way to implement this without being dependent on the order of the columns, or what's the preferred way of saving data to an external relational database from spark?
Thanks!
--chris
I would simply force expected order on write:
url = ...
table = ...
columns = (sqlContext.read.format('jdbc')
.options(url=url, dbtable=table)
.load()
.columns())
forecastFrame.select(*columns).write.jdbc(url=url, dbtable=table, mode='append')
Also be careful with using schema inference on dictionaries. This is not only deprecated but also rather unstable.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With