Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to quickly get the last line of a huge csv file (48M lines)? [duplicate]

I have a csv file that grows until it reaches approximately 48M of lines.

Before adding new lines to it, I need to read the last line.

I tried the code below, but it got too slow and I need a faster alternative:

def return_last_line(filepath):    
    with open(filepath,'r') as file:        
        for x in file:
            pass
        return x        
return_last_line('lala.csv')
like image 486
Luiz Fernando Avatar asked Mar 06 '21 15:03

Luiz Fernando


People also ask

How do I read the last line of a CSV file in Python?

Read Last Line of File With the readlines() Function in Python. The file. readlines() function reads all the lines of a file and returns them in the form of a list. We can then get the last line of the file by referencing the last index of the list using -1 as an index.

How do I handle a large CSV file?

So, how do you open large CSV files in Excel? Essentially, there are two options: Split the CSV file into multiple smaller files that do fit within the 1,048,576 row limit; or, Find an Excel add-in that supports CSV files with a higher number of rows.

How to get the last 3 lines of the file?

The above get-content tail command output show the last 3 lines of the file. To get last line of the file, use tail parameter with the value 1. Here -tail 1 show the bottom line of the file

How to get tail of last 50 lines of big file?

To get tail of last 50 lines of big file and export it csv file, use below command Get-Content "C:[&PowerShell&]EventLog_Setup.txt" -tail 50 | Out-File -FilePath "C:[&PowerShell&]output.csv" In the above command, Get-Content cmdlet -tail parameter gets last 50 lines of the code.

How to read last n lines of a file using Python?

Let’s discuss different ways to read last N lines of a file using Python. File: Method 1: Naive approach. In this approach, the idea is to use a negative iterator with the readlines () function to read all the lines requested by the user from the end of file. def LastNlines (fname, N): with open(fname) as file:

How to get last 50 lines of the code in PowerShell?

Get-Content "C:[&PowerShell&]EventLog_Setup.txt" -tail 50 | Out-File -FilePath "C:[&PowerShell&]output.csv" In the above command, Get-Content cmdlet -tail parameter gets last 50 lines of the code. Using pipe operator, it passed extracted lines to Out-File for export to csv file. Using wait parameter to display new line


6 Answers

Here is my take, in python: I created a function that lets you choose how many last lines, because the last lines may be empty.

def get_last_line(file, how_many_last_lines = 1):

    # open your file using with: safety first, kids!
    with open(file, 'r') as file:

        # find the position of the end of the file: end of the file stream
        end_of_file = file.seek(0,2)
        
        # set your stream at the end: seek the final position of the file
        file.seek(end_of_file)             
        
        # trace back each character of your file in a loop
        n = 0
        for num in range(end_of_file+1):            
            file.seek(end_of_file - num)    
           
            # save the last characters of your file as a string: last_line
            last_line = file.read()
           
            # count how many '\n' you have in your string: 
            # if you have 1, you are in the last line; if you have 2, you have the two last lines
            if last_line.count('\n') == how_many_last_lines: 
                return last_line
get_last_line('lala.csv', 2)

This lala.csv has 48 million lines, such as in your example. It took me 0 seconds to get the last line.

like image 89
Sergio Marinho Avatar answered Oct 18 '22 05:10

Sergio Marinho


Here is code for finding the last line of a file mmap, and it should work on Unixen and derivatives and Windows alike (I've tested this on Linux only, please tell me if it works on Windows too ;), i.e. pretty much everywhere where it matters. Since it uses memory mapped I/O it could be expected to be quite performant.

It expects that you can map the entire file into the address space of a processor - should be OK for 50M file everywhere but for 5G file you'd need a 64-bit processor or some extra slicing.

import mmap


def iterate_lines_backwards(filename):
    with open(filename, "rb") as f:
        # memory-map the file, size 0 means whole file
        with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
            start = len(mm)

            while start > 0:
                start, prev = mm.rfind(b"\n", 0, start), start
                slice = mm[start + 1:prev + 1]
                # if the last character in the file was a '\n',
                # technically the empty string after that is not a line.
                if slice:
                    yield slice.decode()


def get_last_nonempty_line(filename):
    for line in iterate_lines_backwards(filename):
        if stripped := line.rstrip("\r\n"):
            return stripped


print(get_last_nonempty_line("datafile.csv"))

As a bonus there is a generator iterate_lines_backwards that would efficiently iterate over the lines of a file in reverse for any number of lines:

print("Iterating the lines of datafile.csv backwards")
for l in iterate_lines_backwards("datafile.csv"):
    print(l, end="")

This is generally a rather tricky thing to do. A very efficient way of getting a chunk that includes the last lines is the following:

import os


def get_last_lines(path, offset=500):
    """ An efficient way to get the last lines of a file.

    IMPORTANT: 
    1. Choose offset to be greater than 
    max_line_length * number of lines that you want to recover.
    2. This will throw an os.OSError if the file is shorter than
    the offset.
    """
    with path.open("rb") as f:
        f.seek(-offset, os.SEEK_END)
        while f.read(1) != b"\n":
            f.seek(-2, os.SEEK_CUR)
        return f.readlines()

You need to know the maximum line length though and ensure that the file is at least one offset long!

To use it, do the following:

from pathlib import Path


n_last_lines = 10
last_bit_of_file = get_last_lines(Path("/path/to/my/file"))
real_last_n_lines = last_bit_of_file[-10:]

Now finally you need to decode the binary to strings:

real_last_n_lines_non_binary = [x.decode() for x in real_last_n_lines]

Probably all of this could be wrapped in one more convenient function.

like image 3
kuropan Avatar answered Oct 18 '22 05:10

kuropan


You could additionally store the last line in a separate file, which you update whenever you add new lines to the main file.

like image 2
Manuel Avatar answered Oct 18 '22 05:10

Manuel


If you are running your code in a Unix based environment, you can execute tail shell command from Python to read the last line:

import subprocess

subprocess.run(['tail', '-n', '1', '/path/to/lala.csv'])
like image 3
Shiva Avatar answered Oct 18 '22 04:10

Shiva


This works well for me:
https://pypi.org/project/file-read-backwards/

from file_read_backwards import FileReadBackwards

with FileReadBackwards("/tmp/file", encoding="utf-8") as frb:

    # getting lines by lines starting from the last line up
    for l in frb:
        if l:
            print(l)
            break
like image 1
Carbon_Unit Avatar answered Oct 18 '22 03:10

Carbon_Unit