I have read multiple StackOverflow articles on this and most of the top 10 Google results. Where my issue deviates is that I am using one script in python to create my JSON files. And the next script, run not 10 minutes later, can't read that very file.
Short version, I generate leads for my online business. I am attempting to learn python in order to have better analytics on these leads. I am scouring 2 years worth of leads with the intent being to retain the useful data and drop anything personal - email addresses, names, etc. - while also saving 30,000+ leads into a few dozen files for easy access.
So my first script opens every single individual lead file - 30,000+ - determines the date it was capture based on a timestamp in the file. Then it saves that lead to the appropriate key in dict. When all the data has been aggregated into this dict text files are written using json.dumps.
The dict's structure is:
addData['lead']['July_2013'] = { ... }
where the 'lead' key can be lead, partial, and a few others and the 'July_2013' key is obviously a date based key that can be any combination of the full month and 2013 or 2014 going back to 'February_2013'.
The full error is this:
ValueError: Unterminated string starting at: line 1 column 9997847 (char 9997846)
But I've manually looked at the file and my IDE says there are only 76,655 chars in the file. So how did it get to 9997846?
The file that fails is the 8th to be read; the other 7 and all other files that come after it read in via json.loads just fine.
Python says there is in an unterminated string so I looked at the end of the JSON in the file that fails and it appears to be fine. I've seen some mention about newlines being \n in JSON but this string is all one line. I've seen mention of \ vs \ but in a quick look over the whole file I didn't see any . Other files do have \ and they read in fine. And, these files were all created by json.dumps.
I can't post the file because it still has personal info in it. Manually attempting to validate the JSON of a 76,000 char file isn't really viable.
Thoughts on how to debug this would be appreciated. In the mean time I am going to try to rebuild the files and see if this wasn't just a one off bug but that takes a while.
--- Edit --- Per request I am posting the Write Code here:
from p2p.basic import files as f
from p2p.adv import strTools as st
from p2p.basic import strTools as s
import os
import json
import copy
from datetime import datetime
import time
global leadDir
global archiveDir
global aggLeads
def aggregate_individual_lead_files():
"""
"""
# Get the aggLead global and
global aggLeads
# Get all the Files with a 'lead' extension & aggregate them
exts = [
'lead',
'partial',
'inp',
'err',
'nobuyer',
'prospect',
'sent'
]
for srchExt in exts:
agg = {}
leads = f.recursiveGlob(leadDir, '*.cd.' + srchExt)
print "There are {} {} files to process".format(len(leads), srchExt)
for lead in leads:
# Get the Base Filename
fname = f.basename(lead)
#uniqID = st.fetchBefore('.', fname)
#print "File: ", lead
# Get Lead Data
leadData = json.loads(f.file_get_contents(lead))
agg = agg_data(leadData, agg, fname)
aggLeads[srchExt] = copy.deepcopy(agg)
print "Aggregate Top Lvl Keys: ", aggLeads.keys()
print "Aggregate Next Lvl Keys: "
for key in aggLeads:
print "{}: ".format(key)
for arcDate in aggLeads[key].keys():
print "{}: {}".format(arcDate, len(aggLeads[key][arcDate]))
# raw_input("Press Enter to continue...")
def agg_data(leadData, agg, fname=None):
"""
"""
#print "Lead: ", leadData
# Get the timestamp of the lead
try:
ts = leadData['timeStamp']
leadData.pop('timeStamp')
except KeyError:
return agg
leadDate = datetime.fromtimestamp(ts)
arcDate = leadDate.strftime("%B_%Y")
#print "Archive Date: ", arcDate
try:
agg[arcDate][ts] = leadData
except KeyError:
agg[arcDate] = {}
agg[arcDate][ts] = leadData
except TypeError:
print "Timestamp: ", ts
print "Lead: ", leadData
print "Archive Date: ", arcDate
return agg
"""
if fname is not None:
archive_lead(fname, arcDate)
"""
#print "File: {} added to {}".format(fname, arcDate)
return agg
def archive_lead(fname, arcDate):
# Archive Path
newArcPath = archiveDir + arcDate + '//'
if not os.path.exists(newArcPath):
os.makedirs(newArcPath)
# Move the file to the archive
os.rename(leadDir + fname, newArcPath + fname)
def reformat_old_agg_data():
"""
"""
# Get the aggLead global and
global aggLeads
aggComplete = {}
aggPartial = {}
oldAggFiles = f.recursiveGlob(leadDir, '*.cd.agg')
print "There are {} old aggregate files to process".format(len(oldAggFiles))
for agg in oldAggFiles:
tmp = json.loads(f.file_get_contents(agg))
for uniqId in tmp:
leadData = tmp[uniqId]
if leadData['isPartial'] == True:
aggPartial = agg_data(leadData, aggPartial)
else:
aggComplete = agg_data(leadData, aggComplete)
arcData = dict(aggLeads['lead'].items() + aggComplete.items())
aggLeads['lead'] = arcData
arcData = dict(aggLeads['partial'].items() + aggPartial.items())
aggLeads['partial'] = arcData
def output_agg_files():
for ext in aggLeads:
for arcDate in aggLeads[ext]:
arcFile = leadDir + arcDate + '.cd.' + ext + '.agg'
if f.file_exists(arcFile):
tmp = json.loads(f.file_get_contents(arcFile))
else:
tmp = {}
arcData = dict(tmp.items() + aggLeads[ext][arcDate].items())
f.file_put_contents(arcFile, json.dumps(arcData))
def main():
global leadDir
global archiveDir
global aggLeads
leadDir = 'D://Server Data//eagle805//emmetrics//forms//leads//'
archiveDir = leadDir + 'archive//'
aggLeads = {}
# Aggregate all the old individual file
aggregate_individual_lead_files()
# Reformat the old aggregate files
reformat_old_agg_data()
# Write it all out to an aggregate file
output_agg_files()
if __name__ == "__main__":
main()
Here is the read code:
from p2p.basic import files as f
from p2p.adv import strTools as st
from p2p.basic import strTools as s
import os
import json
import copy
from datetime import datetime
import time
global leadDir
global fields
global fieldTimes
global versions
def parse_agg_file(aggFile):
global leadDir
global fields
global fieldTimes
try:
tmp = json.loads(f.file_get_contents(aggFile))
except ValueError:
print "{} failed the JSON load".format(aggFile)
return False
print "Opening: ", aggFile
for ts in tmp:
try:
tmpTs = float(ts)
except:
print "Timestamp: ", ts
continue
leadData = tmp[ts]
for field in leadData:
if field not in fields:
fields[field] = []
fields[field].append(float(ts))
def determine_form_versions():
global fieldTimes
global versions
# Determine all the fields and their start and stop times
times = []
for field in fields:
minTs = min(fields[field])
fieldTimes[field] = [minTs, max(fields[field])]
times.append(minTs)
print 'Min ts: {}'.format(minTs)
times = set(sorted(times))
print "Times: ", times
print "Fields: ", fieldTimes
versions = {}
for ts in times:
d = datetime.fromtimestamp(ts)
ver = d.strftime("%d_%B_%Y")
print "Version: ", ver
versions[ver] = []
for field in fields:
if ts in fields[field]:
versions[ver].append(field)
def main():
global leadDir
global fields
global fieldTimes
leadDir = 'D://Server Data//eagle805//emmetrics//forms//leads//'
fields = {}
fieldTimes = {}
aggFiles = f.glob(leadDir + '*.lead.agg')
for aggFile in aggFiles:
parse_agg_file(aggFile)
determine_form_versions()
print "Versions: ", versions
if __name__ == "__main__":
main()
Python is a computer programming language often used to build websites and software, automate tasks, and conduct data analysis. Python is a general-purpose language, meaning it can be used to create a variety of different programs and isn't specialized for any specific problems.
Python is written in C (actually the default implementation is called CPython).
Python is widely considered among the easiest programming languages for beginners to learn. If you're interested in learning a programming language, Python is a good place to start. It's also one of the most widely used.
Python is undoubtedly considered a top programming language at the same level as JavaScript or C++, and it's one of the most used languages by businesses and enterprises. Even though it's almost 30 years old, Python is still relevant, given its ease of use, its vibrant community, and many applications.
So I figured it out... I post this answer just in case someone else makes the same error.
First, I found a work around but I wasn't sure why this worked. From my original code, here is my file_get_contents
function:
def file_get_contents(fname):
if s.stripos(fname, 'http://'):
import urllib2
return urllib2.urlopen(fname).read(maxUrlRead)
else:
return open(fname).read(maxFileRead)
I used it via:
tmp = json.loads(f.file_get_contents(aggFile))
This failed, over and over and over again. However, as I was attempting to get Python to at least give me the JSON string to put through a JSON validator I came across mention of json.load
vs json.loads
. So I tried this instead:
a = open('D://Server Data//eagle805//emmetrics//forms//leads\July_2014.cd.lead.agg')
b = json.load(a)
While I haven't tested this output in my overall code this code chunk does in fact read in the file, decode the JSON, and will even display the data without crashing Spyder. The variable explorer in Spyder shows that b is a dict of size 1465 and that is exactly how many records it should have. The portion of the displayed text from the end of the dict all looks good. So overall I have a reasonably high level confidence that the data was parsed correctly.
When I wrote the file_get_contents
function I saw several recommendations that I always provide a max number of bytes to read so as to prevent Python from hanging on a bad return. The value of maxReadFile
was 1E7
. When I manually forced maxReadFile
to be 1E9
everything worked fine. Turns out the file is just under 1.2E7 bytes. So the resulting string from reading the file was not the full string in the file and as a result was invalid JSON.
Normally I would think this is a bug but clearly when opening and reading a file you need to be able to read just a chunk at a time for memory management. So I got bit by my own shortsightedness with regards to the maxReadFile
value. The error message was correct but sent me off on a wild goose chase.
Hopefully this could save someone else some time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With