Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading a file with a specified delimiter for newline

I have a file in which lines are separated using a delimeter say .. I want to read this file line by line, where lines should be based on presence of . instead of newline.

One way is:

f = open('file','r')
for line in f.read().strip().split('.'):
   #....do some work
f.close()

But this is not memory efficient if my file is too large. Instead of reading a whole file together I want to read it line by line.

open supports a parameter 'newline' but this parameter only takes None, '', '\n', '\r', and '\r\n' as input as mentioned here.

Is there any way to read files line efficiently but based on a pre-specified delimiter?

like image 364
Abhishek Gupta Avatar asked Apr 28 '13 05:04

Abhishek Gupta


People also ask

Does readline include newline?

The readline method reads one line from the file and returns it as a string. The string returned by readline will contain the newline character at the end.

How a line is read from file removing newline character?

You could actually put the newlines to good use by reading the entire file into memory as a single long string and then use them to split that into the list of grades by using the string splitlines() method which, by default, removes them in the process. with open("grades. dat") as file: grades = [line.

How do I read a file line by line?

The readLine() method of BufferedReader class reads file line by line, and each line appended to StringBuffer, followed by a linefeed.


2 Answers

You could use a generator:

def myreadlines(f, newline):
  buf = ""
  while True:
    while newline in buf:
      pos = buf.index(newline)
      yield buf[:pos]
      buf = buf[pos + len(newline):]
    chunk = f.read(4096)
    if not chunk:
      yield buf
      break
    buf += chunk

with open('file') as f:
  for line in myreadlines(f, "."):
    print line
like image 143
NPE Avatar answered Oct 07 '22 21:10

NPE


Here is a more efficient answer, using FileIO and bytearray that I used for parsing a PDF file -

import io
import re


# the end-of-line chars, separated by a `|` (logical OR)
EOL_REGEX = b'\r\n|\r|\n'  

# the end-of-file char
EOF = b'%%EOF'



def readlines(fio):
    buf = bytearray(4096)
    while True:
        fio.readinto(buf)
        try:
            yield buf[: buf.index(EOF)]
        except ValueError:
            pass
        else:
            break
        for line in re.split(EOL_REGEX, buf):
            yield line


with io.FileIO("test.pdf") as fio:
    for line in readlines(fio):
        ...

The above example also handles a custom EOF. If you don't want that, use this:

import io
import os
import re


# the end-of-line chars, separated by a `|` (logical OR)
EOL_REGEX = b'\r\n|\r|\n'  


def readlines(fio, size):
    buf = bytearray(4096)
    while True:
        if fio.tell() >= size:
            break               
        fio.readinto(buf)            
        for line in re.split(EOL_REGEX, buf):
            yield line

size = os.path.getsize("test.pdf")
with io.FileIO("test.pdf") as fio:
    for line in readlines(fio, size):
         ...
like image 43
Dev Aggarwal Avatar answered Oct 07 '22 20:10

Dev Aggarwal