Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python download large csv file from a url line by line for only 10 entries

Tags:

python

csv

I have a large csv file of the client and shared via a url to download and I want to download it line by line or by bytes and I want to limit only for 10 entries.

I have the following code which will download the file, but i want here to download only the first 10 entries from the file, I don't want the full file.

#!/usr/bin/env python
import requests
from contextlib import closing
import csv

url = "https://example.com.au/catalog/food-catalog.csv"

with closing(requests.get(url, stream=True)) as r:
    f = (line.decode('utf-8') for line in r.iter_lines())
    reader = csv.reader(f, delimiter=',', quotechar='"')
    for row in reader:
        print(row)

I don't know much about contextlib, how it will work with with in Python.

Can anyone help me here, it would be really helpful, and thanks in advance.

like image 409
chethi Avatar asked Dec 17 '18 12:12

chethi


People also ask

How do I read a 10gb CSV file in Python?

read_csv(chunksize) One way to process large files is to read the entries in chunks of reasonable size, which are read into the memory and are processed before reading the next chunk. We can use the chunk size parameter to specify the size of the chunk, which is the number of lines.


3 Answers

The issue is not so much with contextlib as with generators. When your with block ends, the connection will be closed, fairly straightforwardly.

The part that actually does the download is for row in reader:, since reader is wrapped around f, which is a lazy generator. Each iteration of the loop will actually read a line from the stream, possibly with some internal buffering by Python.

The key then is to stop the loop after 10 lines. There area couple of simple ways of doing that:

for count, row in enumerate(reader, start=1):
    print(row)

    if count == 10:
        break

Or

from itertools import islice

...

for row in islice(reader, 0, 10):
    print(row)
like image 80
Mad Physicist Avatar answered Oct 21 '22 18:10

Mad Physicist


Pandas can also be an approach:

import pandas as pd

#create a datafram from your original csv, with "," as your separator 
#and limiting the read to the first 10 rows
#here, I also configured it to also read it as UTF-8 encoded
your_csv = pd.read_csv("https://example.com.au/catalog/food-catalog.csv", sep = ',', nrows = 10, encoding = 'utf-8')

#You can now print it:
print(your_csv)

#And even save it:
your_csv.to_csv(filePath, sep = ',', encoding = 'utf-8')
like image 28
Pedro Martins de Souza Avatar answered Oct 21 '22 18:10

Pedro Martins de Souza


You can generalize the idea by making a generator that will yield the next n lines on every call. The grouper recipe from the itertools module is useful for things like this.

import requests
import itertools
import csv
import contextlib

def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    return itertools.zip_longest(*args, fillvalue=fillvalue)

def stream_csv_download(chunk_size):
    url = 'https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2017-financial-year-provisional/Download-data/annual-enterprise-survey-2017-financial-year-provisional-csv.csv'
    with contextlib.closing(requests.get(url, stream=True)) as stream:
        lines = (line.decode('utf-8') for line in stream.iter_lines(chunk_size))
        reader = csv.reader(lines, delimiter=',', quotechar='"')
        chunker = grouper(reader, chunk_size, None)
        while True:
            try:
                yield [line for line in next(chunker)]
            except StopIteration:
                return

csv_file = stream_csv_download(10)

This definitely does buffer some amount of data as the calls are quick but I don't think that it is downloading the entire file. I'll have to test with a large file.

like image 38
Austin Mackillop Avatar answered Oct 21 '22 18:10

Austin Mackillop