Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

download file using s3fs

I am trying to download a csv file from an s3 bucket using the s3fs library. I have noticed that writing a new csv using pandas has altered data in some way. So I want to download the file directly in its raw state.

The documentation has a download function but I do not understand how to use it:

download(self, rpath, lpath[, recursive]): Alias of FilesystemSpec.get.

Here's what I tried:

import pandas as pd
import datetime
import os
import s3fs
import numpy as np

#Creds for s3
fs = s3fs.S3FileSystem(key=mykey, secret=mysecretkey)
bucket = "s3://mys3bucket/mys3bucket"
files = fs.ls(bucket)[-3:]


#download files:
for file in files:
    with fs.open(file) as f:
        fs.download(f,"test.csv")

AttributeError: 'S3File' object has no attribute 'rstrip'
like image 317
Jacky Avatar asked Jul 21 '20 15:07

Jacky


People also ask

How do I download files from AWS?

In the web client, do one of the following: Select the one or more check boxes next to the files or folders that you want to download, open the Actions menu and choose Download. Open the file, then open the Actions menu and choose Download. Open the folder.

What is s3fs used for?

s3fs is a FUSE filesystem that allows you to mount an Amazon S3 bucket as a local filesystem. It stores files natively and transparently in S3 (i.e., you can use other programs to access the same files). Maximum file size=64GB (limited by s3fs, not Amazon).

How do I download from S3 bucket?

You can download an object from an S3 bucket in any of the following ways: Select the object and choose Download or choose Download as from the Actions menu if you want to download the object to a specific folder. If you want to download a specific version of the object, select the Show versions button.

What is Python s3fs?

S3Fs is a Pythonic file interface to S3. It builds on top of botocore. The top-level class S3FileSystem holds connection information and allows typical file-system style operations like cp , mv , ls , du , glob , etc., as well as put/get of local files to/from S3.


2 Answers

for file in files:
    fs.download(file,'test.csv')

Modified to download all files in the directory:

import pandas as pd
import datetime
import os
import s3fs
import numpy as np

#Creds for s3
fs = s3fs.S3FileSystem(key=mykey, secret=mysecretkey)
bucket = "s3://mys3bucket/mys3bucket"

#files references the entire bucket.
files = fs.ls(bucket)

for file in files:
    fs.download(file,'test.csv')
like image 174
Jacky Avatar answered Oct 04 '22 06:10

Jacky


I'm going to copy my answer here as well since I used this in a more general case:

# Access Pando
import s3fs
#Blocked out url as "enter url here" for security reasons
fs = s3fs.S3FileSystem(anon=True, client_kwargs={'endpoint_url':"enter url here"})

# List objects in a path and import to array
# -3 limits output for testing purposes to prevent memory overload
files = fs.ls('hrrr/sfc/20190101')[-3:]

#Make a staging directory that can hold data as a medium
os.mkdir("Staging")

#Copy files into that directory (specific directory structure requires splitting strings)
for file in files:
    item = str(file)
    lst = item.split("/")
    name = lst[3]
    path = "Staging\\" + name
    print(path)
    fs.download(file, path)

Note that the documentation is fairly barren for this particular python package. I was able to find some documentation regarding what arguments s3fs takes here (https://readthedocs.org/projects/s3fs/downloads/pdf/latest/). The full arguments list is toward the end, though they don't specify what the parameters mean. Here's a general guide for s3fs.download:

-arg1 (rpath) is the source directory for where you are getting the files from. As in both above answers, the best way to obtain this is to do an fs.ls on your s3 bucket and save that to a variable

-arg2 (lpath) is the destination directory and file name. Note that without a valid output file, this will return the Attribute Error OP got. I have this defined as a path variable

-arg3 is an optional parameter to choose to perform the download recursively

like image 24
Zach Rieck Avatar answered Oct 04 '22 06:10

Zach Rieck