Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Connecting DynamoDB from Spark program to load all items from one table using Python?

I have written a program to write items into DynamoDB table. Now I would like to read all items from the DynamoDB table using PySpark. Are there any libraries available to do this in Spark?

like image 943
sms_1190 Avatar asked Feb 04 '16 19:02

sms_1190


People also ask

Can Spark read from DynamoDB?

The simplest way for Spark to interact with DynamoDB is to build a connector that talks to DynamoDB by implementing the simple Hadoop interfaces. Amazon EMR provides an implementation of this connector as part of emr-hadoop-ddb.

Can we perform join in DynamoDB?

No. Joins are resource-intensive queries that do not scale well as your database grows in size. Therefore, DynamoDB does not allow "join" queries. However, it is possible to perform joins on DynamoDB tables via external services such as Apache Hive and Amazon EMR.


1 Answers

You can use parallel scans available as part of the DynamoDB API through boto3 and a scheme like the parallel S3 file processing application written for PySpark described here. Basically, instead of reading all the keys a-priori, just create a list of segment numbers and hard code the max number of segments for scan in the map_func function for Spark.

like image 116
Alexander Patrikalakis Avatar answered Sep 24 '22 01:09

Alexander Patrikalakis