Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Limiting the size of a json string for posting to a server

Tags:

python

json

post

I'm posting hundreds of thousands of JSON records to a server that has a MAX data upload limit of 1MB. My records can be of very variable size, from as little as a few hundred bytes, to a few hundred thousand.

def checkSize(payload):
    return len(payload) >= bytesPerMB 


toSend = []
for row in rows:
    toSend.append(row)
    postData = json.dumps(toSend)
    tooBig = tooBig or checkSize()
    if tooBig:
          sendToServer(postData)

Which then posts to the server. It currently works, but the constant dumping of toSend to a jsonified string seems really heavy and almost 100% too much, although I can't seem to find a way to do it another. Would I be ok with stringifying the individual new records and keeping a tally of what they would be together?

I'm sure there must be a cleaner way of doing this, but I just don't know.

Thanks for any and all help given.


This is the answer I'm now using, I came up with it at the same time as @rsegal below, just posting for clarity and completion (sendToServer is just a dummy function to show things are working correctly),

import pickle
import json

f = open("userProfiles")
rows = pickle.load(f)
f.close()

bytesPerMB = 1024 * 1024
comma = ","
appendSize = len(comma)

def sendToServer(obj):
    #send to server
    pass

def checkSize(numBytes):
    return numBytes >= bytesPerMB

def jsonDump(obj):
    return json.dumps(obj, separators=(comma, ":"))

leftover = []
numRows = len(rows)
rowsSent = 0

while len(rows) > 0:
    toSend = leftover[:]
    toSendSize = len( jsonDump(toSend) )
    leftover = []
    first = len(toSend) == 0

    while True:
        try:
            row = rows.pop()
        except IndexError:
            break

        rowSize = len( jsonDump(row) ) + (0 if first else appendSize)
        first = False

        if checkSize(toSendSize + rowSize):
            leftover.append(row)
            break

        toSend.append(row)
        toSendSize += rowSize

    rowsSent += len(toSend)
    postData = jsonDump(toSend)
    print "assuming to send '{0}' bytes, actual size '{1}'. rows sent {2}, total {3}".format(toSendSize, len(postData), rowsSent, numRows)
    sendToServer(postData)
like image 472
seaders Avatar asked Nov 04 '22 18:11

seaders


1 Answers

I would do something like the following:

toSend = []
toSendLength = 0
for row in rows:
    tentativeLength = len(json.dumps(row))
    if tentativeLength > bytesPerMB:
        parsingBehavior // do something about lolhuge files
    elif toSendLength + tentativeLength > bytesPerMB: // it would be too large
        sendToServer(json.dumps(toSend)) // don\'t exceed limit; send now
        toSend = [row] // refresh for next round - and we know it fits!
        toSendLength = tentativeLength
    else: // otherwise, it wont be too long, so add it in
        toSend.append(row)
        toSendLength += tentative
sentToServer(json.dumps(toSend)) // if it finishes below the limit

The issue with your solution is that it's not great from a Big-O perspective. Mine runs in linear time, yours would run in quadratic time, because you're checking the cumulative length every loop. Resetting postData every time isn't very efficient, either.

like image 78
rsegal Avatar answered Nov 12 '22 17:11

rsegal