I'm posting hundreds of thousands of JSON records to a server that has a MAX data upload limit of 1MB. My records can be of very variable size, from as little as a few hundred bytes, to a few hundred thousand.
def checkSize(payload):
return len(payload) >= bytesPerMB
toSend = []
for row in rows:
toSend.append(row)
postData = json.dumps(toSend)
tooBig = tooBig or checkSize()
if tooBig:
sendToServer(postData)
Which then posts to the server. It currently works, but the constant dumping of toSend to a jsonified string seems really heavy and almost 100% too much, although I can't seem to find a way to do it another. Would I be ok with stringifying the individual new records and keeping a tally of what they would be together?
I'm sure there must be a cleaner way of doing this, but I just don't know.
Thanks for any and all help given.
This is the answer I'm now using, I came up with it at the same time as @rsegal below, just posting for clarity and completion (sendToServer is just a dummy function to show things are working correctly),
import pickle
import json
f = open("userProfiles")
rows = pickle.load(f)
f.close()
bytesPerMB = 1024 * 1024
comma = ","
appendSize = len(comma)
def sendToServer(obj):
#send to server
pass
def checkSize(numBytes):
return numBytes >= bytesPerMB
def jsonDump(obj):
return json.dumps(obj, separators=(comma, ":"))
leftover = []
numRows = len(rows)
rowsSent = 0
while len(rows) > 0:
toSend = leftover[:]
toSendSize = len( jsonDump(toSend) )
leftover = []
first = len(toSend) == 0
while True:
try:
row = rows.pop()
except IndexError:
break
rowSize = len( jsonDump(row) ) + (0 if first else appendSize)
first = False
if checkSize(toSendSize + rowSize):
leftover.append(row)
break
toSend.append(row)
toSendSize += rowSize
rowsSent += len(toSend)
postData = jsonDump(toSend)
print "assuming to send '{0}' bytes, actual size '{1}'. rows sent {2}, total {3}".format(toSendSize, len(postData), rowsSent, numRows)
sendToServer(postData)
I would do something like the following:
toSend = []
toSendLength = 0
for row in rows:
tentativeLength = len(json.dumps(row))
if tentativeLength > bytesPerMB:
parsingBehavior // do something about lolhuge files
elif toSendLength + tentativeLength > bytesPerMB: // it would be too large
sendToServer(json.dumps(toSend)) // don\'t exceed limit; send now
toSend = [row] // refresh for next round - and we know it fits!
toSendLength = tentativeLength
else: // otherwise, it wont be too long, so add it in
toSend.append(row)
toSendLength += tentative
sentToServer(json.dumps(toSend)) // if it finishes below the limit
The issue with your solution is that it's not great from a Big-O perspective. Mine runs in linear time, yours would run in quadratic time, because you're checking the cumulative length every loop. Resetting postData every time isn't very efficient, either.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With