Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to write huge data in file

I am trying to create a random real, integers, alphanumeric, alpha strings and then writing to a file till the file size reaches 10MB.

The code is as follows.

import string
import random
import time
import sys


class Generator():
    def __init__(self):
        self.generate_alphabetical_strings()
        self.generate_integers()
        self.generate_alphanumeric()
        self.generate_real_numbers()

    def generate_alphabetical_strings(self):
        return ''.join(random.choice(string.ascii_lowercase) for i in range(12))

    def generate_integers(self):
        return ''.join(random.choice(string.digits) for i in range(12))

    def generate_alphanumeric(self):
        return ''.join(random.choice(self.generate_alphabetical_strings() +
                                     self.generate_integers()) for i in range(12))

    def _insert_dot(self, string, index):
        return string[:index].__add__('.').__add__(string[index:])


    def generate_real_numbers(self):
        rand_int_string = ''.join(random.choice(self.generate_integers()) for i in range(12))
        return self._insert_dot(rand_int_string, random.randint(0, 11))


from time import process_time
import os

a = Generator()

t = process_time()
inp = open("test.txt", "w")
lt = 10 * 1000 * 1000
count = 0
while count <= lt:
    inp.write(a.generate_alphanumeric())
    count += 39
inp.close()

elapsed_time = process_time() - t
print(elapsed_time)

It takes around 225.953125 seconds to complete. How can i improve the speed of this program? Please provide some code insights?

like image 507
ajknzhol Avatar asked Dec 09 '14 16:12

ajknzhol


People also ask

What is the best method to write large amount of data to a file?

The best solution would be implement own Writer which directly uses write(byte[]) method of FileOutputStream which used underlying native writeBytes method . like @DavidMoles said source format of data is also very important in this scenario. If data is already available in bytes write directly to FileOutputSteam.

How do you write multiple data from a text file in Java?

boolean append = true; String filename = "/path/to/file"; BufferedWriter writer = new BufferedWriter(new FileWriter(filename, append)); // OR: BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(filename, append))); writer. write(line1); writer. newLine(); writer.

What is the difference between BufferedWriter and FileWriter?

FileWriter writes directly into Files and should be used only when the number of writes is less. BufferedWriter: BufferedWriter is almost similar to FileWriter but it uses internal buffer to write data into File. So if the number of write operations is more, the actual IO operations are less and performance is better.


3 Answers

You literally create billions of objects which you then quickly throw away. In this case, it's probably better to write the strings directly into the file instead of concatenating them with ''.join().

like image 35
Aaron Digulla Avatar answered Oct 17 '22 11:10

Aaron Digulla


Two major reasons for observed "slowness":

  • your while loop is slow, it has about a million iterations.
  • You do not make proper use of I/O buffering. Do not make so many system calls. Currently, you are calling write() about one million times.

Create your data in a Python data structure first and call write() only once.

This is faster:

t0 = time.time()
open("bla.txt", "wb").write(''.join(random.choice(string.ascii_lowercase) for i in xrange(10**7)))
d = time.time() - t0
print "duration: %.2f s." % d

Output: duration: 7.30 s.

Now the program spends most of its time generating the data, i.e. in random stuff. You can easily see that by replacing random.choice(string.ascii_lowercase) with e.g. "a". Then the measured time drops to below one second on my machine.

And if you want to get even closer to seeing how fast your machine really is when writing to disk, use Python's fastest (?) way to generate largish data before writing it to disk:

>>> t0=time.time(); chunk="a"*10**7; open("bla.txt", "wb").write(chunk); d=time.time()-t0; print "duration: %.2f s." % d
duration: 0.02 s.
like image 161
Dr. Jan-Philip Gehrcke Avatar answered Oct 17 '22 11:10

Dr. Jan-Philip Gehrcke


The while loop under main calls generate_alphanumeric, which chooses several characters out of (fresh randomly generated) strings composed of twelve ascii letters and twelve numbers. That's basically the same as choosing randomly either a random letter or a random number twelve times. That's your main bottleneck. This version will make your code one order of magnitude faster:

def generate_alphanumeric(self):
    res = ''
    for i in range(12):
        if random.randrange(2):
            res += random.choice(string.ascii_lowercase)
        else:
            res += random.choice(string.digits)
    return res

I'm sure it can be improved upon. I suggest you take your profiler for a spin.

like image 1
debiatan Avatar answered Oct 17 '22 12:10

debiatan