I have written a program to write items into DynamoDB table. Now I would like to read all items from the DynamoDB table using PySpark. Are there any libraries available to do this in Spark?
The simplest way for Spark to interact with DynamoDB is to build a connector that talks to DynamoDB by implementing the simple Hadoop interfaces. Amazon EMR provides an implementation of this connector as part of emr-hadoop-ddb.
No. Joins are resource-intensive queries that do not scale well as your database grows in size. Therefore, DynamoDB does not allow "join" queries. However, it is possible to perform joins on DynamoDB tables via external services such as Apache Hive and Amazon EMR.
You can use parallel scans available as part of the DynamoDB API through boto3 and a scheme like the parallel S3 file processing application written for PySpark described here. Basically, instead of reading all the keys a-priori, just create a list of segment numbers and hard code the max number of segments for scan in the map_func
function for Spark.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With