I have a large csv file of the client and shared via a url to download and I want to download it line by line or by bytes and I want to limit only for 10 entries.
I have the following code which will download the file, but i want here to download only the first 10 entries from the file, I don't want the full file.
#!/usr/bin/env python
import requests
from contextlib import closing
import csv
url = "https://example.com.au/catalog/food-catalog.csv"
with closing(requests.get(url, stream=True)) as r:
f = (line.decode('utf-8') for line in r.iter_lines())
reader = csv.reader(f, delimiter=',', quotechar='"')
for row in reader:
print(row)
I don't know much about contextlib
, how it will work with with
in Python.
Can anyone help me here, it would be really helpful, and thanks in advance.
read_csv(chunksize) One way to process large files is to read the entries in chunks of reasonable size, which are read into the memory and are processed before reading the next chunk. We can use the chunk size parameter to specify the size of the chunk, which is the number of lines.
The issue is not so much with contextlib
as with generators. When your with
block ends, the connection will be closed, fairly straightforwardly.
The part that actually does the download is for row in reader:
, since reader
is wrapped around f
, which is a lazy generator. Each iteration of the loop will actually read a line from the stream, possibly with some internal buffering by Python.
The key then is to stop the loop after 10 lines. There area couple of simple ways of doing that:
for count, row in enumerate(reader, start=1):
print(row)
if count == 10:
break
Or
from itertools import islice
...
for row in islice(reader, 0, 10):
print(row)
Pandas can also be an approach:
import pandas as pd
#create a datafram from your original csv, with "," as your separator
#and limiting the read to the first 10 rows
#here, I also configured it to also read it as UTF-8 encoded
your_csv = pd.read_csv("https://example.com.au/catalog/food-catalog.csv", sep = ',', nrows = 10, encoding = 'utf-8')
#You can now print it:
print(your_csv)
#And even save it:
your_csv.to_csv(filePath, sep = ',', encoding = 'utf-8')
You can generalize the idea by making a generator that will yield the next n lines on every call. The grouper
recipe from the itertools
module is useful for things like this.
import requests
import itertools
import csv
import contextlib
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return itertools.zip_longest(*args, fillvalue=fillvalue)
def stream_csv_download(chunk_size):
url = 'https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2017-financial-year-provisional/Download-data/annual-enterprise-survey-2017-financial-year-provisional-csv.csv'
with contextlib.closing(requests.get(url, stream=True)) as stream:
lines = (line.decode('utf-8') for line in stream.iter_lines(chunk_size))
reader = csv.reader(lines, delimiter=',', quotechar='"')
chunker = grouper(reader, chunk_size, None)
while True:
try:
yield [line for line in next(chunker)]
except StopIteration:
return
csv_file = stream_csv_download(10)
This definitely does buffer some amount of data as the calls are quick but I don't think that it is downloading the entire file. I'll have to test with a large file.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With