Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to run python code on AWS (EC2/Lambda)

This post is to get basic information/links to understand running Py code on either Lambda or EC2.

My code structure is pretty simple:

import numpy as np
import pandas as pd
#load more packages

input_data = pd.read_csv(...)

def do_stuff(input, parameters):
     action1
     action2
     output.to_csv(...)
     plt.save_fig(...)

do_stuff(input_data,input_parameter)

I need to run this code on AWS, but I am not sure which to use: Lambda or EC2. Also, the input file is on my local PC, and output gets saved to a specific folder. Do I need to save it to the S3. If so, how does the path look like? Do I still use import os

I'm sorry for this noob like question. I need some starting guidance on what should I read to get started. Going through the AWS documentation becomes technical - and from the "Hello World" on Lambda - I couldn't understand much. Due to the lockdown, I'm unable to use my office desktop, and my personal mac cannot handle the loads. The input and output files are pretty small - cumulatively less than 5 MB (there are multiple input files).

like image 248
Hemanshu Das Avatar asked Oct 26 '22 20:10

Hemanshu Das


2 Answers

If this is something that needs to be done often, you could potentially create a workflow where you:

  1. Upload the input .csv file into a S3 bucket

  2. Your AWS Lambda function listens for changes in the S3 bucket and your code is triggered to run when a new file is uploaded.

  3. Your code saves the output .csv to a second S3 bucket.

The code might look very roughly like this (modified from this example ):

import boto3
import os
import sys
from urllib.parse import unquote_plus

s3_client = boto3.client('s3')
def handle_csv(original_csv_path, output_csv_path):
    <process csv code>

def lambda_handler(event, context):
    for record in event['Records']:
        bucket = record['s3']['bucket']['name']
        key = unquote_plus(record['s3']['object']['key'])
        tmpkey = key.replace('/', '')
        download_path = <insert path here>
        upload_path = <insert path here>
        s3_client.download_file(bucket, key, download_path)
        handle_csv(download_path, upload_path)
        s3_client.upload_file(upload_path, '<>'.format(bucket), key)

A path might look like: '<bucket/object>': 'input_csvs/1.csv'

like image 154
Adi Dembak Avatar answered Nov 15 '22 06:11

Adi Dembak


This the OP here, and though I don't have a good answer here - I can summarize what I've figured out.

Lambda: I found a helpful youtube video to understand how to get Lambda working. Also, to use Py packages such as numpy and pandas, you'll need to add a Lambda layer. I was able to do it by going through this Medium post. But I hadn't completely figured out how to connect my input CVS files and export my output CSV file. I stopped dead in my tracks when I realized Lambda can run for a max continuous runtime of 15mins. My Markov simulation code takes 24 hours - so Lambda was out of question, and I didn't pursue further. (P.S: I read later there are some "complicated" ways to make it work - but nah - I wasn't even clear how will Lambda services get charged).

EC2: There are couple of resources that helped a lot for running my code on a EC2 AWS Linux server. A medium post on running Jupyter server was the most helpful, and then I switched to using python and conda on the terminal itself through another helpful medium post. Further, I'm using the dropbox API and python package to push my output files to the cloud from the code run on EC2.

TLDR: Lambda won't work for me, and EC2 worked largely thanks to a medium post. Also, I need to understand how CLI code works to get a better grasp of how things work.

like image 33
Hemanshu Das Avatar answered Nov 15 '22 06:11

Hemanshu Das