If I have a csv file that's too large to load into memory with pandas (in this case 35gb), I know it's possible to process the file in chunks, with chunksize.
However I want to know if it's possible to change chunksize based on values in a column.
I have an ID column, and then several rows for each ID with information, like this:
ID, Time, x, y
sasd, 10:12, 1, 3
sasd, 10:14, 1, 4
sasd, 10:32, 1, 2
cgfb, 10:02, 1, 6
cgfb, 10:13, 1, 3
aenr, 11:54, 2, 5
tory, 10:27, 1, 3
tory, 10:48, 3, 5
ect...
I don't want to separate IDs into different chunks. for example chunks of size 4 would be processed:
ID, Time, x, y
sasd, 10:12, 1, 3
sasd, 10:14, 1, 4
sasd, 10:32, 1, 2
cgfb, 10:02, 1, 6
cgfb, 10:13, 1, 3 <--this extra line is included in the 4 chunk
ID, Time, x, y
aenr, 11:54, 2, 5
tory, 10:27, 1, 3
tory, 10:48, 3, 5
...
Is it possible?
If not perhaps using the csv library with a for loop along the lines of:
for line in file:
x += 1
if x > 1000000 and curid != line[0]:
break
curid = line[0]
#code to append line to a dataframe
although I know this would only create one chunk, and for loops take a long time to process.
Selecting columns based on their name This is the most basic way to select a single column from a dataframe, just put the string name of the column in brackets. Returns a pandas series. Passing a list in the brackets lets you select multiple columns at the same time.
Sometimes, we use the chunksize parameter while reading large datasets to divide the dataset into chunks of data. We specify the size of these chunks with the chunksize parameter. This saves computational memory and improves the efficiency of the code.
Technically the number of rows read at a time in a file by pandas is referred to as chunksize. Suppose If the chunksize is 100 then pandas will load the first 100 rows. The object returned is not a data frame but a TextFileReader which needs to be iterated to get the data.
If you iterate through the csv file line by line, you can yield
chunks with a generator dependent on any column.
Working example:
import pandas as pd
def iter_chunk_by_id(file):
csv_reader = pd.read_csv(file, iterator=True, chunksize=1, header=None)
first_chunk = csv_reader.get_chunk()
id = first_chunk.iloc[0,0]
chunk = pd.DataFrame(first_chunk)
for l in csv_reader:
if id == l.iloc[0,0]:
id = l.iloc[0,0]
chunk = chunk.append(l)
continue
id = l.iloc[0,0]
yield chunk
chunk = pd.DataFrame(l)
yield chunk
## data.csv ##
# 1, foo, bla
# 1, off, aff
# 2, roo, laa
# 3, asd, fds
# 3, qwe, tre
# 3, tre, yxc
chunk_iter = iter_chunk_by_id("data.csv")
for chunk in chunk_iter:
print(chunk)
print("_____")
Output:
0 1 2
0 1 foo bla
1 1 off aff
_____
0 1 2
2 2 roo laa
3 2 jkl xds
_____
0 1 2
4 3 asd fds
5 3 qwe tre
6 3 tre yxc
_____
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With