Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Pandas vs. CSV reader/writer to process and save large CSV file

Tags:

python

pandas

csv

I'm fairly new to python and pandas but trying to get better with it for parsing and processing large data files. I'm currently working on a project that requires me to parse a few dozen large CSV CAN files at the time. The files have 9 columns of interest (1 ID and 7 data fields), have about 1-2 million rows, and are encoded in hex.

A sample bit of data looks like this:

   id   Flags   DLC Data0   Data1   Data2   Data3   Data4   Data5   Data6   Data7
cf11505  4      1   ff                          
cf11505  4      1   ff                          
cf11505  4      1   ff                          
cf11a05  4      1   0                           
cf11505  4      1   ff                          
cf11505  4      1   ff                          
cf11505  4      1   ff                          
cf11005  4      8   ff       ff      ff      ff      ff       ff      ff     ff

I need to decode the hex, and then extract a bunch of different variables from it depending on the CAN ID.

A colleague of mine wrote a script to parse these files that looks like this (henceforth known as Script #1):

import sys      # imports the sys module
import itertools
import datetime
import time
import glob, os

for filename in glob.glob(sys.argv[1] + "/*.csv"): 
    print('working on ' + filename +'...')

    #Initialize a bunch of variables

    csvInput = open(filename, 'r') # opens the csv file
    csvOutput = open((os.path.split(filename)[0] + os.path.split(filename)[1]), 'w', newline='')

    writer = csv.writer(csvOutput) #creates the writer object
    writer.writerow([var1, var2, var3, ...])

    try:
        reader = csv.reader(csvInput)
        data=list(reader)

        if (data[3][1] == 'HEX'): dataType = 16
        elif (data[3][1] == 'DEC'): dataType = 10
        else: print('Invalid Data Type')

        if (data[4][1] == 'HEX'): idType = 16
        elif (data[4][1] == 'DEC'): idType = 10
        else: print('Invalid ID Type') 

        start_date = datetime.datetime.strptime(data[6][1],'%Y-%m-%d %H:%M:%S')      

        for row in itertools.islice(data,8,None):
            try: ID = int(row[2],idType)
            except: ID = 0

            if (ID == 0xcf11005):
                for i in range(0,4): var1[i] = float((int(row[2*i+6],dataType)<<8)|

            #similar operations for a bunch of variables go here

            writer.writerow([var1[0], var2[1],.....])

    finally:
        csvInput.close()
        csvOutput.close()

print(end - start)
print('done')

It basically uses the CSV reader and writer to generate a processed CSV file line by line for each CSV. For a 2 million row CSV CAN file, it takes about 40 secs to fully run on my work desktop. Knowing that line by line iteration is much slower than performing vectorized operations on a pandas dataframe, I thought I could do better, so I wrote a script that looks like this (Script #2):

from timeit import default_timer as timer
import numpy as np
import pandas as pd
import os
import datetime
from tkinter import filedialog
from tkinter import Tk

Tk().withdraw()
filename = filedialog.askopenfile(title="Select .csv log file", filetypes=(("CSV files", "*.csv"), ("all files", "*.*")))

name = os.path.basename(filename.name)
##################################################
df = pd.read_csv(name, skiprows = 7, usecols = ['id', 'Data0', 'Data1', 'Data2', 'Data3', 'Data4', 'Data5', 'Data6', 'Data7'], 
                 dtype = {'id':str, 'Data0':str, 'Data1':str, 'Data2':str, 'Data3':str, 'Data4':str, 'Data5':str, 'Data6':str, 'Data7':str})

log_cols = ['id', 'Data0', 'Data1','Data2', 'Data3', 'Data4', 'Data5', 'Data6', 'Data7']

for col in log_cols: 
    df[col] = df[col].dropna().astype(str).apply(lambda x: int(x, 16))   

df.loc[:, 'Data0':'Data7'] = df.loc[:, 'Data0':'Data7'].fillna(method = 'ffill') #forward fill empty rows
df.loc[:, 'Data0':'Data7'] = df.loc[:, 'Data0':'Data7'].fillna(value = 0) #replace any remaining nans with 0

df['Data0'] = df['Data0'].astype(np.uint8)
df.loc[:, 'Data0':'Data7'] = df.loc[:, 'Data0':'Data7'].astype(np.uint8)

processed_df = pd.DataFrame(np.nan, index= range(0, len(df)), columns= ['var1' 'var2', 'var3', ...])

start_date = datetime.datetime.strptime('7/17/2018 14:12:48','%m/%d/%Y %H:%M:%S')

processed_df ['Time Since Start (s)']  = pd.read_csv(name, skiprows = 7, usecols = ['Time'], dtype = {'Time':np.float32}, engine = 'c')
processed_df['Date'] = pd.to_timedelta(processed_df['Time Since Start (s)'], unit = 's') + start_date
processed_df['id'] = df['id']

processed_df.loc[:, 'var1':'var37'] = processed_df.loc[:, 'var1':'var37'].astype(np.float32)

##################Data Processing###########################
processed_df.loc[processed_df.id == int(0xcf11005), 'var1'] = np.bitwise_or(np.left_shift(df['Data1'], 8), df['Data0'])/10
#a bunch of additional similar vectorized calculations go here to pull useful values

name_string = "Processed_" + name
processed_df.to_csv(name_string) #dump dataframe to CSV

The processing part was definitely faster, although not as much as I had hoped--it took about 13 seconds to process the 2 million row CSV file. There's probably some more I could do to optimize script #2, but that's a topic for another post.

Anyway, my hopes that script #2 would actually be faster than the script #1 one were dashed when I tried to save the dataframe as a CSV. the .to_csv() method took 40s alone! I tried playing around with a few parameters in the .to_csv() method, including chunk size and compression, as well as reducing the memory footprint of the dataframe, but even with these tweaks it still took 30s to save the dataframe, and once you factored in the initial processing time, the entire script was slower than the original row by row script #1.

Is row by row iteration of a CSV file really the most computationally efficient way to parse these files?

like image 724
P-Rock Avatar asked Sep 19 '25 03:09

P-Rock


1 Answers

The dask library might be worth a look. It implements a subset of the pandas DataFrame functionality, but stores the DataFrames on disk rather than in-memory, and allows you to use the DataFrame as if it were in memory. I believe it can even treat multiple files as a single DataFrame among other things like using multiple machines to do things in parallel.

This was faster for me when I was dealing with a 6GB CSV with millions of rows.

https://dask.pydata.org/en/latest/

like image 88
pynterest Avatar answered Sep 21 '25 15:09

pynterest