Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading Very Large One Liner Text File

I have a 30MB .txt file, with one line of data (30 Million Digit Number)
Unfortunately, every method I've tried (mmap.read(), readline(), allocating 1GB of RAM, for loops) takes 45+ minutes to completely read the file. Every method I found on the internet seems to work on the fact that each line is small, therefore the memory consumption is only as big as the biggest line in the file. Here's the code I've been using.

start = time.clock()
z = open('Number.txt','r+') 
m = mmap.mmap(z.fileno(), 0)
global a
a = int(m.read())
z.close()
end = time.clock()
secs = (end - start)
print("Number read in","%s" % (secs),"seconds.", file=f)
print("Number read in","%s" % (secs),"seconds.")
f.flush()
del end,start,secs,z,m

Other than splitting the number from one line to various lines; which I'd rather not do, is there a cleaner method which won't require the better part of an hour?

By the way, I don't necessarily have to use text files.

I have: Windows 8.1 64-Bit, 16GB RAM, Python 3.5.1

like image 748
Master-chip Avatar asked Apr 24 '16 07:04

Master-chip


People also ask

How can I read a large text file?

The best way to view extremely large text files is to use… a text editor. Not just any text editor, but the tools meant for writing code. Such apps can usually handle large files without a hitch and are free. Large Text File Viewer is probably the simplest of these applications.

How do I read one line of a file at a time?

The readline() method helps to read just one line at a time, and it returns the first line from the file given. We will make use of readline() to read all the lines from the file given. To read all the lines from a given file, you can make use of Python readlines() function.


1 Answers

The file read is quick (<1s):

with open('number.txt') as f:
    data = f.read()

Converting a 30-million-digit string to an integer, that's slow:

z=int(data) # still waiting...

If you store the number as raw big- or little-endian binary data, then int.from_bytes(data,'big') is much quicker.

If I did my math right (Note _ means "last line's answer" in Python's interactive interpreter):

>>> import math
>>> math.log(10)/math.log(2)  # Number of bits to represent a base 10 digit.
3.3219280948873626
>>> 30000000*_                # Number of bits to represent 30M-digit #.
99657842.84662087
>>> _/8                       # Number of bytes to represent 30M-digit #.
12457230.35582761             # Only ~12MB so file will be smaller :^)
>>> import os
>>> data=os.urandom(12457231) # Generate some random bytes
>>> z=int.from_bytes(data,'big')  # Convert to integer (<1s)
99657848
>>> math.log10(z)   # number of base-10 digits in number.
30000001.50818886

EDIT: FYI, my math wasn't right, but I fixed it. Thanks for 10 upvotes without noticing :^)

like image 110
Mark Tolonen Avatar answered Oct 19 '22 04:10

Mark Tolonen