It seems like a question that should have already been asked before, but I couldn't find it. So here it goes.
The data:
- a master list
( length ~= 16,000,000 ) of sublists ( each has length upto 500 items ) of str
The aim:
- to shuffle each of the sublists within the master list efficiently.
I have attempted the straight for
-loop, list comprehension, pandas Series.apply()
, pandarallel
, and dask
dataframe .apply()
and .map_partition()
methods.
A for
-loop takes about 15 minutes.pd.series.apply()
, dask.series.apply()
, and dask.series.map_partition()
all managed to do it just over 6 minutes.
My question is "can I achieve the shuffling faster"? Both producing a new copy or shuffling in place are acceptable.
Below is my attempt :
def normal_shuffle(series):
output = series.tolist()
length = len(output)
for i in range(length):
random.Random().shuffle(output[i])
return output
def shuffle_returned(a_list):
new_list = a_list
random.shuffle(new_list)
return new_list
def shuffle_partition(a_partition):
return a_partition.apply(shuffle_returned)
%time shuffled_for = normal_shuffle(test_series)
%time shuffled_apply = test_series.apply(shuffle_returned)
pandarallel.initialize(progress_bar=False, nb_workers=8)
%time shuffled_parallel_apply = test_series.parallel_apply(shuffle_returned)
test_ddf = ddf.from_pandas(test_series, npartitions=16)
test_ddf = test_ddf.reset_index(drop=True)
shuffled_ddf = test_ddf.apply(shuffle_returned, meta="some_str")
%time shuffled_ddf.persist()
shuffled_by_parttion_ddf = test_ddf.map_partitions(shuffle_partition, meta="productId")
%time shuffled_by_parttion_ddf.persist()
Now I try to use dask distributed to see, if I can somehow stagger the model training and data shuffling so that the training time and shuffling time overlaps and achieve a better overall time efficiency.
I would very much appreciate any feedback or suggestion on how I can make it shuffling operation more efficient.
UPDATE
Having tried some of the suggestions, the following turned out to be fastest I could achieve, which is also surprisingly simple!
%time [np.random.shuffle(x) for x in alist]
CPU times: user 23.7 s, sys: 160 ms, total: 23.9 s
Wall time: 23.9 s
Single thread numpy is the way to go here, it seems!
@TerryH - you need not
.shuffle()
the RAM-memory-content ofaListOfSTRINGs
at all, it ought be just enough to generate anp.random.permutation( len( aListOfListsOfSTRINGs[ ith ] ) )
so as to create ad-hoc, at a cost of butO(1)
~260 [us]
per list, spent ALAP, a new random order, right-sized for an indirect access to thestr
-members of the ith-aListOfSTRINGs
( why moving RAM-I/O expensive data so as to "read"-in-order somewhere later, when no data need ever be touched, until ALAP "reading" from cache-served block(s) using an indirect-addressing of the components? )
For the Real-World costs of a wish to go parallel, you may like this post, with an interactive graph-tool.
As @user2357112 supports Monica commented below,
shuffling was aimed to take place rather inside aListOfSTRINGs
, not on aListOfListsOfSTRINGs
, Mea Culpa
Q : "can I achieve the shuffling faster"?
Yes. A lot. ...150 x
times - well under 2.5 [s]
are achievable with the right-enough tools
Q : "... how I can make it shuffling operation more efficient ?"
The in-place .shuffle()
takes less than ~ 23 [s]
on list( L )
over 16,000,000 items in plain Py2.7 tools
from zmq import Stopwatch; aClk = Stopwatch() #_______________________ a [us] Stopwatch
pass; import random
#_____________L creation ~ 2.7 [s]___________________________________________
aClk.start(); L = [ strID for strID in xrange( int( 16E6 ) ) ]; aClk.stop()
2721084
print L[:5] #___________________________________________________________proof
[0, 1, 2, 3, 4]
#_____________random.shuffle( L )______________________________________+proof
aClk.start(); random.shuffle( L ); aClk.stop(); print "0:5\t", L[:5]
21473261
0:5 [13868243, 13087869, 13207292, 9344202, 1853783]
#_____________random.shuffle( L )______________________________________+proof
aClk.start(); random.shuffle( L ); aClk.stop(); print "0:5\t", L[:5]
22573922
0:5 [837396, 15032889, 10942767, 14571341, 4867854]
#_______________________________________________________________________proof
>>> len( L )
16000000
The in-place .shuffle()
takes under ~ 48 [s]
on list( L )
over 16,000,000 items in plain Py3.5 tools.
$ conda activate py3
$ python
...
aClk.start(); L = [ strID for strID in range( int( 16E6 ) ) ]; aClk.stop()
1959052
#_____________random.shuffle( L )______________________________________+proof
aClk.start(); random.shuffle( L ); aClk.stop(); print( "0:5\t", L[:5] )
45104806
0:5 [15744525, 10635923, 14530509, 10535840, 1465987]
#_____________random.shuffle( L )______________________________________+proof
aClk.start(); random.shuffle( L ); aClk.stop(); print( "0:5\t", L[:5] )
47139358
0:5 [884437, 15420153, 9957947, 8118734, 11960914]
import numpy as np
#____________L_as_a32______________16E6________________________~ 74 [ms]
>>> aClk.start(); a32 = np.arange( 16E6, dtype = np.int32 ); aClk.stop()
74054
#_____________np.random.shuffle( a32-bit )______________________________+proof
aClk.start(); np.random.shuffle( a32 ); aClk.stop(); print "0:5\t", a32[:5]
2400786
0:5 [ 2487493 14646705 13717283 5602561 7934593]
aClk.start(); np.random.shuffle( a32 ); aClk.stop(); print "0:5\t", a32[:5]
2368381
0:5 [ 4841042 12882529 12298351 2198866 7054284]
aClk.start(); np.random.shuffle( a32 ); aClk.stop(); print "0:5\t", a32[:5]
2369011
0:5 [14595649 7239135 3339593 9517600 6506681]
#_____________np.random.shuffle( a64-bit )______________________________+proof
aClk.start(); np.random.shuffle( a64 ); aClk.stop(); print "0:5\t", a64[:5]
2424487
0:5 [ 3234133 9224551 971604 13027484 806393]
aClk.start(); np.random.shuffle( a64 ); aClk.stop(); print "0:5\t", a64[:5]
2386873
0:5 [ 3212124 10644428 8192909 2234984 13103406]
aClk.start(); np.random.shuffle( a64 ); aClk.stop(); print "0:5\t", a64[:5]
2376065
0:5 [ 5624301 7741070 8859092 12287465 11721315]
str
data as-is, stored in just aListOfSTRINGs
aListOfSTRINGs
append at a constant cost of O(1)
to a non-re-shuffled, linearly growing, constant-order storage - aListOfListsOfSTRINGs
aListOfListsOfSTRINGs
, rather maintain a aListOfORDINALs
( be it a plain-list
or a numpy
-array, where just appending a len( aListOfListsOfSTRINGs )
, whenever a new member-aListOfSTRINGs
got in )aListOfORDINALs.shuffle()
, well under 23 [s]
in Py2.7 or < 50 [s]
in Py3.5str
-s asaListOfListsOfSTRINGs[aListOfORDINALs[Nth_L_inLoLoStr]][Nth_str_inLoStr]
at superfast times at costs of O(1)
to get the actual str
-sIf you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With