Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Streaming data for pandas df

I'm attempting to simulate the use of pandas to access a constantly changing file.

I have one file reading a csv file, adding a line to it then sleeping for a random time to simulate bulk input.

import pandas as pd
from time import sleep
import random

df2 = pd.DataFrame(data = [['test','trial']], index=None)

while True:
    df = pd.read_csv('data.csv', header=None)
    df.append(df2)
    df.to_csv('data.csv', index=False)
    sleep(random.uniform(0.025,0.3))

The second file is checking for change in data by outputting the shape of the dataframe:

import pandas as pd

while True:
    df = pd.read_csv('data.csv', header=None, names=['Name','DATE'])
    print(df.shape)

The problem with that is while I'm getting the correct shape of the DF, there are certain times where it's outputting (0x2).

i.e.:

...
(10x2)
(10x2)
...
(10x2)
(0x2)
(11x2)
(11x2)
...

This does occur at some but not between each change in shape (the file adding to dataframe).

Knowing this happens when the first script is opening the file to add data, and the second script is unable to access it, hence (0x2), will this occur any data loss?

I cannot directly access the stream, only the output file. Or are there any other possible solutions?

Edit

The purpose of this is to load the new data only (I have a code that does that) and do analysis "on the fly". Some of the analysis will include output/sec, graphing (similar to stream plot), and few other numerical calculations.

The biggest issue is that I have access to the csv file only, and I need to be able to analyze the data as it comes without loss or delay.

like image 992
Leb Avatar asked Sep 15 '15 19:09

Leb


1 Answers

One of the scripts is reading the file while the other is trying to write to the file. Both scripts cannot access the file at the same time. Like Padraic Cunningham says in the comments you can implement a lock file to solve this problem.

There is a python package that will do just that called lockfile with documentation here.

Here is your first script with the lockfile package implemented:

import pandas as pd
from time import sleep
import random
from lockfile import FileLock

df2 = pd.DataFrame(data = [['test','trial']], index=None)
lock = FileLock('data.lock')

while True:
    with lock:
        df = pd.read_csv('data.csv', header=None)
        df.append(df2)
        df.to_csv('data.csv', index=False)
    sleep(random.uniform(0.025,0.3))

Here is you second script with the lockfile package implemented:

import pandas as pd
from time import sleep
from lockfile import FileLock

lock = FileLock('data.lock')

while True:
    with lock:
        df = pd.read_csv('data.csv', header=None, names=['Name','DATE'])
    print(df.shape)
    sleep(0.100)

I added a wait of 100ms so that I could slow down the output to the console.

These scripts will create a file called "data.lock" before accessing the "data.csv" file and delete the file "data.lock" after accessing the "data.csv" file. In either script, if the "data.lock" exists, the script will wait until the "data.lock" file no longer exists.

like image 75
Joshua Goldberg Avatar answered Oct 04 '22 09:10

Joshua Goldberg