Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast extraction of chunks of lines from large CSV file

I have a large CSV file full of stock-related data formatted as such:

Ticker Symbol, Date, [some variables...]

So each line starts of with the symbol (like "AMZN"), then has the date, then has 12 variables related to price or volume on the selected date. There are about 10,000 different securities represented in this file and I have a line for each day that the stock has been publicly traded for each of them. The file is ordered first alphabetically by ticker symbol and second chronologically by date. The entire file is about 3.3 GB.

The sort of task I want to solve would be to be able to extract the most recent n lines of data for a given ticker symbol with respect to the current date. I have code that does this, but based on my observations it seems to take, on average, around 8-10 seconds per retrieval (all tests have been extracting 100 lines).

I have functions I'd like to run that require me to grab such chunks for hundreds or thousands of symbols, and I would really like to reduce the time. My code is inefficient, but I am not sure how to make it run faster.

First, I have a function called getData:

def getData(symbol, filename):
  out = ["Symbol","Date","Open","High","Low","Close","Volume","Dividend",
         "Split","Adj_Open","Adj_High","Adj_Low","Adj_Close","Adj_Volume"]
  l = len(symbol)
  beforeMatch = True
  with open(filename, 'r') as f:
    for line in f:
        match = checkMatch(symbol, l, line)
        if beforeMatch and match:
            beforeMatch = False
            out.append(formatLineData(line[:-1].split(",")))
        elif not beforeMatch and match:
            out.append(formatLineData(line[:-1].split(",")))
        elif not beforeMatch and not match:
            break
  return out

(This code has a couple of helper functions, checkMatch and formatLineData, which I will show below.) Then, there is another function called getDataColumn that gets the column I want with the correct number of days represented:

def getDataColumn(symbol, col=12, numDays=100, changeRateTransform=False):
  dataset = getData(symbol)
  if not changeRateTransform:
    column = [day[col] for day in dataset[-numDays:]]
  else:
    n = len(dataset)
    column = [(dataset[i][col] - dataset[i-1][col])/dataset[i-1][col] for i in range(n - numDays, n)]
  return column

(changeRateTransform converts raw numbers into daily change rate numbers if True.) The helper functions:

def checkMatch(symbol, symbolLength, line):
  out = False
  if line[:symbolLength+1] == symbol + ",":
    out = True
  return out

def formatLineData(lineData):
  out = [lineData[0]]
  out.append(datetime.strptime(lineData[1], '%Y-%m-%d').date())
  out += [float(d) for d in lineData[2:6]]
  out += [int(float(d)) for d in lineData[6:9]]
  out += [float(d) for d in lineData[9:13]]
  out.append(int(float(lineData[13])))
  return out

Does anyone have any insight on what parts of my code run slow and how I can make this perform better? I can't do the sort of analysis I want to do without speeding this up.


EDIT: In response to the comments, I made some changes to the code in order to utilize the existing methods in the csv module:

def getData(symbol, database):
  out = ["Symbol","Date","Open","High","Low","Close","Volume","Dividend",
         "Split","Adj_Open","Adj_High","Adj_Low","Adj_Close","Adj_Volume"]
  l = len(symbol)
  beforeMatch = True
  with open(database, 'r') as f:
    databaseReader = csv.reader(f, delimiter=",")
    for row in databaseReader:
        match = (row[0] == symbol)
        if beforeMatch and match:
            beforeMatch = False
            out.append(formatLineData(row))
        elif not beforeMatch and match:
            out.append(formatLineData(row))
        elif not beforeMatch and not match:
            break
  return out

def getDataColumn(dataset, col=12, numDays=100, changeRateTransform=False):
  if not changeRateTransform:
    out = [day[col] for day in dataset[-numDays:]]
  else:
    n = len(dataset)
    out = [(dataset[i][col] - dataset[i-1][col])/dataset[i-1][col] for i in range(n - numDays, n)]
  return out

Performance was worse using the csv.reader class. I tested on two stocks, AMZN (near top of file) and ZNGA (near bottom of file). With the original method, the run times were 0.99 seconds and 18.37 seconds, respectively. With the new method leveraging the csv module, the run times were 3.04 seconds and 64.94 seconds, respectively. Both return the correct results.

My thought is that the time is being taken up more from finding the stock than from the parsing. If I try these methods on the first stock in the file, A, the methods both run in about 0.12 seconds.

like image 542
Jeff Davis Avatar asked Dec 17 '16 19:12

Jeff Davis


People also ask

What can you do with a large CSV file?

For example, they can display a more complete form of the data if Excel can’t handle it, and they can be edited by hand. So what does this have to do with large .csv files in particular? Microsoft Excel can only display the first 1,048,576 rows and 16,384 columns of data.

How to read a large CSV file in Python?

Optimized ways to Read Large CSVs in Python. 1 1. pandas.read_csv () Input: Read CSV file. Output: pandas dataframe. pandas.read_csv () loads the whole CSV file at once in the memory in a single ... 2 2. pandas.read_csv (chunksize) 3 3. To make your hands dirty in DASK, should glance over the below link.

How much time does it take to read a CSV file?

The time taken is about 4 seconds which might not be that long, but for entries that have millions of rows, the time taken to read the entries has a direct effect on the efficiency of the model. Now, let us use chunks to read the CSV file: As you can see chunking takes much lesser time compared to reading the entire file at one go.

How many rows and columns are in a CSV file?

So what does this have to do with large .csv files in particular? Microsoft Excel can only display the first 1,048,576 rows and 16,384 columns of data. And at some point, you are going to encounter a .csv file with way more than that much within it.


Video Answer


2 Answers

When you're going to do lots of analysis on the same dataset, the pragmatic approach would be to read it all into a database. It is made for fast querying; CSV isn't. Use the sqlite command line tools, for example, which can directly import from CSV. Then add a single index on (Symbol, Date) and lookups will be practically instantaneous.

If for some reason that is not feasible, for example because new files can come in at any moment and you cannot afford the preparation time before starting your analysis of them, you'll have to make the best of dealing with CSV directly, which is what the rest of my answer will focus on. Remember that it's a balancing act, though. Either you pay a lot upfront, or a bit extra for every lookup. Eventually, for some amount of lookups it would have been cheaper to pay upfront.

Optimization is about maximizing the amount of work not done. Using generators and the built-in csv module aren't going to help much with that in this case. You'd still be reading the whole file and parsing all of it, at least for line breaks. With that amount of data, it's a no-go.

Parsing requires reading, so you'll have to find a way around it first. Best practices of leaving all intricacies of the CSV format to the specialized module bear no meaning when they can't give you the performance you want. Some cheating must be done, but as little as possible. In this case, I suppose it is safe to assume that the start of a new line can be identified as b'\n"AMZN",' (sticking with your example). Yes, binary here, because remember: no parsing yet. You could scan the file as binary from the beginning until you find the first line. From there read the amount of lines you need, decode and parse them the proper way, etc. No need for optimization there, because a 100 lines are nothing to worry about compared to the hundreds of thousands of irrelevant lines you're not doing that work for.

Dropping all that parsing buys you a lot, but the reading needs to be optimized as well. Don't load the whole file into memory first and skip as many layers of Python as you can. Using mmap lets the OS decide what to load into memory transparently and lets you work with the data directly.

Still you're potentially reading the whole file, if the symbol is near the end. It's a linear search, which means the time it takes is linearly proportional to the number of lines in the file. You can do better though. Because the file is sorted, you could improve the function to instead perform a kind of binary search. The number of steps that will take (where a step is reading a line) is close to the binary logarithm of the number of lines. In other words: the number of times you can divide your file into two (almost) equally sized parts. When there are one million lines, that's a difference of five orders of magnitude!

Here's what I came up with, based on Python's own bisect_left with some measures to account for the fact that your "values" span more than one index:

import csv
from itertools import islice
import mmap

def iter_symbol_lines(f, symbol):
    # How to recognize the start of a line of interest
    ident = b'"' + symbol.encode() + b'",'
    # The memory-mapped file
    mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
    # Skip the header
    mm.readline()
    # The inclusive lower bound of the byte range we're still interested in
    lo = mm.tell()
    # The exclusive upper bound of the byte range we're still interested in
    hi = mm.size()
    # As long as the range isn't empty
    while lo < hi:
        # Find the position of the beginning of a line near the middle of the range
        mid = mm.rfind(b'\n', 0, (lo+hi)//2) + 1
        # Go to that position
        mm.seek(mid)
        # Is it a line that comes before lines we're interested in?
        if mm.readline() < ident:
            # If so, ignore everything up to right after this line
            lo = mm.tell()
        else:
            # Otherwise, ignore everything from right before this line
            hi = mid
    # We found where the first line of interest would be expected; go there
    mm.seek(lo)
    while True:
        line = mm.readline()
        if not line.startswith(ident):
            break
        yield line.decode()

with open(filename) as f:
    r = csv.reader(islice(iter_symbol_lines(f, 'AMZN'), 10))
    for line in r:
        print(line)

No guarantees about this code; I didn't pay much attention to edge cases, and I couldn't test with (any of) your file(s), so consider it a proof of concept. It is plenty fast, however – think tens of milliseconds on an SSD!

like image 182
Thijs van Dien Avatar answered Oct 22 '22 19:10

Thijs van Dien


So I have an alternative solution which I ran and tested on my own as well with a sample data set that I got on Quandl that appears to have all the same headers and similar data. (Assuming that I havent misunderstood the end result that your trying to achieve).

I have this command line tool that one of our engineers built for us for parsing massive csvs - since I deal with absurd amount of data on a day to day basis - it is open sourced and you can get it here: https://github.com/DataFoxCo/gocsv

I also already wrote the short bash script for it in case you don't want to pipeline the commands but it does also support pipelining.

The command to run the following short script follows a super simple convention:

bash tickers.sh wikiprices.csv 'AMZN' '2016-12-\d+|2016-11-\d+'

#!/bin/bash


dates="$3"
cat "$1" \
  | gocsv filter --columns 'ticker' --regex "$2" \
  | gocsv filter --columns 'date' --regex "$dates" > "$2"'-out.csv'
  • both arguments for ticker and for dates are regexes
  • You can add as many variations as your want into that one regex, separating them by |.
  • So if you wanted AMZN and MSFT then you would simply modify it to this: AMZN|MSFT

  • I did something very similar with the dates - but i only limited my sample run to any dates from this month or last month.

End Result

Starting data:

myusername$ gocsv dims wikiprices.csv    
Dimensions:
  Rows: 23946
  Columns: 14

myusername$ bash tickers.sh wikiprices.csv 'AMZN|MSFT' '2016-12-\d+'

myusername$ gocsv dims AMZN|MSFT-out.csv
Dimensions:
  Rows: 24
  Columns: 14

Here is a sample where I limited to only those 2 tickers and then to december only:

enter image description here

Voila - in a matter of seconds you have a second file saved with out the data you care about.

The gocsv program has great documentation by the way - and a ton of other functions e.g. running a vlookup basically at any scale (which is what inspired the creator to make the tool)

like image 30
Aurielle Perlmann Avatar answered Oct 22 '22 19:10

Aurielle Perlmann