I have a function set up for Pandas
that runs through a large number of rows in input.csv
and inputs the results into a Series. It then writes the Series to output.csv
.
However, if the process is interrupted (for example by an unexpected event) the program will terminate and all data that would have gone into the csv is lost.
Is there a way to write the data continuously to the csv, regardless of whether the function finishes for all rows?
Prefarably, each time the program starts, a blank output.csv
is created, that is appended to while the function is running.
import pandas as pd
df = pd.read_csv("read.csv")
def crawl(a):
#Create x, y
return pd.Series([x, y])
df[["Column X", "Column Y"]] = df["Column A"].apply(crawl)
df.to_csv("write.csv", index=False)
By using pandas. DataFrame. to_csv() method you can write/save/export a pandas DataFrame to CSV File. By default to_csv() method export DataFrame to a CSV file with comma delimiter and row index as the first column.
apply is not faster in itself but it has advantages when used in combination with DataFrames. This depends on the content of the apply expression. If it can be executed in Cython space, apply is much faster (which is the case here).
Pandas DataFrame to_csv() function converts DataFrame into CSV data. We can pass a file object to write the CSV data into a file. Otherwise, the CSV data is returned in the string format.
This is a possible solution that will append the data to a new file as it reads the csv in chunks. If the process is interrupted the new file will contain all the information up until the interruption.
import pandas as pd
#csv file to be read in
in_csv = '/path/to/read/file.csv'
#csv to write data to
out_csv = 'path/to/write/file.csv'
#get the number of lines of the csv file to be read
number_lines = sum(1 for row in (open(in_csv)))
#size of chunks of data to write to the csv
chunksize = 10
#start looping through data writing it to a new file for each chunk
for i in range(1,number_lines,chunksize):
df = pd.read_csv(in_csv,
header=None,
nrows = chunksize,#number of rows to read at each loop
skiprows = i)#skip rows that have been read
df.to_csv(out_csv,
index=False,
header=False,
mode='a',#append data to csv file
chunksize=chunksize)#size of data to append for each loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With