Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS lambda memory usage with temporary files in python code

Does data written to temporary files contribute to memory usage in AWS lambda? In a lambda function, I'm streaming a file to a temporary file. In the lambda logs, I see that the max memory used is larger than the file that was downloaded. Strangely, if the lambda is invoked multiple times in quick succession the invocations that downloaded smaller files still report the max memory used from the invocation that downloaded the larger file. I have the concurrency limit set to 2.

When I run the code locally, my memory usage is as expected at around 20MB. On lambda it is 180MB, which is about the size of the file that is streamed. The code is simply using python requests library to stream a file download, to shutil.copyfileobj() to write to tempfile.TemporaryFile(), which is then piped to postgres "copy from stdin".

This makes it seem like the /tmp storage counts towards memory usage but haven't found any mention of this. The only mention of /tmp in the lambda documentation is that there is a 512mb limit.

Example code:

import sys
import json
import os
import io
import re
import traceback
import shutil
import tempfile

import boto3
import psycopg2
import requests


def handler(event, context):
    try:
        import_data(event["report_id"])
    except Exception as e:
        notify_failed(e, event)
        raise

def import_data(report_id):
    token = get_token()
    conn = psycopg2.connect(POSTGRES_DSN, connect_timeout=30)
    cur = conn.cursor()

    metadata = load_metadata(report_id, token)
    table = ensure_table(metadata, cur, REPLACE_TABLE)
    conn.commit()
    print(f"report {report_id}: downloading")
    with download_report(report_id, token) as f:
        print(f"report {report_id}: importing data")
        with conn, cur:
            cur.copy_expert(f"COPY {table} FROM STDIN WITH CSV HEADER", f)
        print(f"report {report_id}: data import complete")
    conn.close()


def download_report(report_id, token):
    url = f"https://some_url"
    params = {"includeHeader": True}
    headers = {"authorization": f"Bearer {token['access_token']}"}

    with requests.get(url, params=params, headers=headers, stream=True) as r:
        r.raise_for_status()
        tmp = tempfile.TemporaryFile()
        print("streaming contents to temporary file")
        shutil.copyfileobj(r.raw, tmp)
        tmp.seek(0)
        return tmp


if __name__ == "__main__":
    if len(sys.argv) > 1:
        handler({"report_id": sys.argv[1]}, None)

UPDATE: After changing the code to not use a temporary file but to just stream the download directly to postgres copy command the memory usage was fixed. Makes me think the /tmp directory is contributing to the logged memory usage.

like image 984
kylehyde215 Avatar asked Dec 22 '18 21:12

kylehyde215


2 Answers

Update

Note: To answer this question, I used Lambdash, although I had to modify the lambda version that is used to node8.10. Lambdash is a simple little library that you can use to run shell commands on a lambda from your local terminal.

The /tmp directory on AWS Lambdas is mounted as a loop device. You can verify this by (after following the setup instructions for lambdash), running the following command:

./lambdash df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda1       30G  4.0G   26G  14% /
/dev/loop0      526M  872K  514M   1% /tmp
/dev/loop1      6.5M  6.5M     0 100% /var/task

According to https://unix.stackexchange.com/questions/278647/overhead-of-using-loop-mounted-images-under-linux,

data accessed through the loop device has to go through two filesystem layers, each doing its own caching so data ends up cached twice, wasting much memory (the infamous "double cache" issue)

However, my guess is that /tmp is actually kept in-memory. To test this, I ran the following commands:

./lambdash df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda1       30G  4.0G   26G  14% /
/dev/loop0      526M  1.9M  513M   1% /tmp
/dev/loop1      6.5M  6.5M     0 100% /var/task

./lambdash dd if=/dev/zero of=/tmp/file.txt count=409600 bs=1024
409600+0 records in
409600+0 records out
419430400 bytes (419 MB) copied, 1.39277 s, 301 MB/s

./lambdash df -h
 Filesystem      Size  Used Avail Use% Mounted on
 /dev/xvda1       30G  4.8G   25G  17% /
 /dev/loop2      526M  401M  114M  78% /tmp
 /dev/loop3      6.5M  6.5M     0 100% /var/task

./lambdash df -h
 Filesystem      Size  Used Avail Use% Mounted on
 /dev/xvda1       30G  4.8G   25G  17% /
 /dev/loop2      526M  401M  114M  78% /tmp
 /dev/loop3      6.5M  6.5M     0 100% /var/task

Keep in mind, each time I ran it, the lambda was executed. Below is the output from the Lambda's Cloudwatch logs:

07:06:30 START RequestId: 4143f502-14a6-11e9-bce4-eff8b92bf218 Version: $LATEST 07:06:30 END RequestId: 4143f502-14a6-11e9-bce4-eff8b92bf218 07:06:30 REPORT RequestId: 4143f502-14a6-11e9-bce4-eff8b92bf218 Duration: 3.60 ms Billed Duration: 100 ms Memory Size: 1536 MB Max Memory Used: 30 MB

07:06:32 START RequestId: 429eca30-14a6-11e9-9b0b-edfabd15c79f Version: $LATEST 07:06:34 END RequestId: 429eca30-14a6-11e9-9b0b-edfabd15c79f 07:06:34 REPORT RequestId: 429eca30-14a6-11e9-9b0b-edfabd15c79f Duration: 1396.29 ms Billed Duration: 1400 ms Memory Size: 1536 MB Max Memory Used: 430 MB

07:06:36 START RequestId: 44a03f03-14a6-11e9-83cf-f375e336ed87 Version: $LATEST 07:06:36 END RequestId: 44a03f03-14a6-11e9-83cf-f375e336ed87 07:06:36 REPORT RequestId: 44a03f03-14a6-11e9-83cf-f375e336ed87 Duration: 3.69 ms Billed Duration: 100 ms Memory Size: 1536 MB Max Memory Used: 431 MB

07:06:38 START RequestId: 4606381a-14a6-11e9-a32d-2956620824ab Version: $LATEST 07:06:38 END RequestId: 4606381a-14a6-11e9-a32d-2956620824ab 07:06:38 REPORT RequestId: 4606381a-14a6-11e9-a32d-2956620824ab Duration: 3.63 ms Billed Duration: 100 ms Memory Size: 1536 MB Max Memory Used: 431 MB

What happened and what does this mean?

The lambda was executed 4 times. On the first execution, I displayed mounted devices. On the second execution, I populated a file in the /tmp directory, utilizing 401Mb of the 500Mb allowed. On the subsequent executions, I listed mounted devices, displaying their available space.

The memory utilization on the first execution was 30Mb. The memory utilization for the subsequent executions was in the 400Mb range.

This confirms that /tmp utilization does in fact contribute to memory utilization.

Original Answer

My guess is that what you are observing is python, or the lambda container itself, buffering the file in memory during write operations.

According to https://docs.python.org/3/library/functions.html#open,

buffering is an optional integer used to set the buffering policy. Pass 0 to switch buffering off (only allowed in binary mode), 1 to select line buffering (only usable in text mode), and an integer > 1 to indicate the size in bytes of a fixed-size chunk buffer. When no buffering argument is given, the default buffering policy works as follows:

Binary files are buffered in fixed-size chunks; the size of the buffer is chosen using a heuristic trying to determine the underlying device’s “block size” and falling back on io.DEFAULT_BUFFER_SIZE. On many systems, the buffer will typically be 4096 or 8192 bytes long. “Interactive” text files (files for which isatty() returns True) use line buffering. Other text files use the policy described above for binary files.

The tempfile.TemporaryFile() function has a keyword parameter, buffering, which is basically passed directly into the open call described above.

So my guess is that the tempfile.TemporaryFile() function uses the default open() function's buffering setting. You might try something like tempfile.TemporaryFile(buffering=0) to disable buffering, or tempfile.TemporaryFile(buffering=512) to explicitly set the maximum amount of memory that will be utilized while writing data to a file.

like image 68
Peter Kirby Avatar answered Sep 28 '22 12:09

Peter Kirby


Usage of /tmp does not count towards memory usage. The only case when this could be correlated is when you read file content into the memory.

like image 44
automatictester Avatar answered Sep 28 '22 10:09

automatictester