I have some giant CSV files - like 23 GB size - in which i want to accomplish this with their column headers -
If there is a column name SFID, perform this - Rename column "Id" to "IgnoreId" Rename column "SFID" to "Id" else- Do nothing
All the google search results i see are about how to import the csv in a dataframe, rename the column, export it back into a csv.
To me it feels like giant waste of time/memory, because we are effectively just working with very first row of the CSV file (which represents headers). I dont know if it is necessary to load whole csv as dataframe and export to a new csv (or export it to same csv, effectively overwriting it).
Being huge CSVs, i have to load them in small chunksize and perform the operation which takes time and memory. Again, feels liek waste of memory becuase apart from the headers, we are not really doing anything with remaining chunksizes
Is there a way i just load up header of a csv file, make changes to headers, and save it back into same csv file?
I am open to ideas of using something other that pandas as well. Only real constraint is that CSV files are too big to just double click and open.
Write the header row first and copy the data rows using shutil.copyfileobj
shutil.copyfileobj took 38 seconds for a 0.5 GB file whereas fileinput took 125 seconds for the same.
Using shutil.copyfileobj
df = pd.read_csv(filename, nrows=0) # read only the header row
if 'SFID' in df.columns:
# rename columns
df.rename(columns = {"Id": "IgnoreId", "SFID":"Id"}, inplace = True)
# construct new header row
header_row = ','.join(df.columns) + "\n"
# modify header in csv file
with open(filename, "r+") as f1, open(filename, "r+") as f2:
f1.readline() # to move the pointer after header row
f2.write(header_row)
shutil.copyfileobj(f1, f2) # copies the data rows
Using fileinput
if 'SFID' in df.columns:
# rename columns
df.rename(columns = {"Id": "IgnoreId", "SFID":"Id"}, inplace = True)
# construct new header row
header_row = ','.join(df.columns)
# modify header in csv file
f = fileinput.input(filename, inplace=True)
for line in f:
if fileinput.isfirstline():
print(header_row)
else:
print(line, end = '')
f.close()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With