For example, what I have are df1
and df2
in different domain:
df1 = pd.DataFrame({"question":["q1","q2"], "answer":["a1","a2"], "domain":"tech"})
df2 = pd.DataFrame({"question":["q3","q4"], "answer":["a3","a4"], "domain":"history"})
print(df1)
question answer domain
0 q1 a1 tech
1 q2 a2 tech
print(df2)
question answer domain
0 q3 a3 history
1 q4 a4 history
What I want is the shuffled data:
print(shuffled1)
question answer domain
0 q3 a3 history
1 q1 a1 tech
print(shuffled2)
question answer domain
0 q2 a2 tech
1 q4 a4 history
In the real world, I have 60+ csv files from different domain which have same structure. Each file have 50k records. They can not be read into memory at the same time.
What I want to do is to feed these files into a Bert model to train it, but the model will do bad if it learn the data from "history" domain for 10k steps and then learning from "tech" domain of another 10k steps. So I want to shuffle the data in the files, to make multiple domain's data evenly distributed in each file.
One answer would be to read one by one each file and spread the lines across N
new files. Doing so, you will obtain "shuffled files" with a similar number of lines and with the same proportion of "original files". Of course, it depends a lot of what kind of shuffled files you would need.
The reading of initial file can be done in parallel, but we would need to coordinate the threads to not write at the same time in the same files. I won't describe that here, because I think it is too much for what is needed here. For example: Python multiprocessing safely writing to a file.
Beside the number of files you have and/or you want, the limiting part below is the shuffling. Given your question, as it is limited to files of 50k lines and machine learning, I think the procedure below is enough. An array of 50k * 10 takes around 4 Mb, so it can be entirely loaded into memory to be shuffled by np.random.shuffle
. If it was much bigger, you need to use another method, see shuffle a large list of items without loading in memory.
Thus, the procedure could be:
N
blocks (considering that the N
is higher than the number of rows)First thing first, I generated 50 files with 100,000 lines of 25 Mb each:
import pandas as pd
import numpy as np
for i in range(50):
arr = np.random.randint(1000, size=(100000,10))
with open(f'bigfile-{i}', 'w') as f: np.savetxt(f, arr, delimiter=',')
That's a rough code, but it works:
originalFiles = [f'bigfile-{i}' for i in range(50)] # paths of your original files
nbShuffled = len( originalFiles ) # number of shuffled files (you can choose)
for i, file in enumerate( originalFiles ):
# 1. Read the original file
with open(file, 'r') as f: lines = f.readlines()
# 2. Shuffle the file
np.random.shuffle( lines )
# 3. Estimate number of lines per block
nbLines = len( lines )
firstBlocks = int( np.floor( nbLines / nbShuffled ) )
lastBlock = int( firstBlocks + nbLines % nbShuffled )
blocks = [firstBlocks] * ( nbShuffled - 1 ) + [lastBlock]
# 4. Write the blocks
np.random.shuffle( blocks ) # avoid that the last block is always in the last shuffle file
x = 0
for b in range( nbShuffled ):
with open( f'bigfile_shuffled-{i}', 'a' ) as f: f.writelines( lines[ x : x + blocks[b] ] )
x += blocks[b]
It took ~13s to run on my computer (Linux 64 bits, 32 Go RAM, 16 CPU).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With