Using Python to analyze large set of sensor-data

Tags:

I'm a researcher new to Python, and I have to analyze a large dataset that contains raw sensordata in an Excel format.

Each Excel-datafile is >100 MB's large for each study participant. The excelfile contains 5 sheets for the measurement of 5 different physiological parameters. Each sheet contains more than 1 million rows and two columns (time, physiological parameter).

After 1 million rows of sensordata, the data automatically continues in the following columns (C and D) in the Excel file.

Every time I try to load the datafile in Python, it takes forever. I was wondering several things:

1) How do I tell Python to read data from a specific Excel sheet? Is it normal that this takes so long?

This is what I tried:

df = pd.read_excel("filepath", sheet_name="Sheetname")
print (df.head (5))

2) Is it feasible to do data munging for this large datafile in Python with Pandas? I tried this to speed up the process:

import xlrd
work_book = xlrd.open_workbook('filepath', on_demand=True)
work_book.release_resources()

3) Later on: I want to compare the physiological parameters of different study participants. As this is a time-series analysis between study participants, how could I get started doing this in Python?

I've learned the basics of Python in a few days, and I love it so far. I realize I have a long way to go.

Update: I think I just finished the time-series analysis (actually just the trend-analysis, using the Dickey-Fuller test and rolling mean visualisation techniques)! :D Thank you all so much for your help!!! The 'datetime' module in pandas was the hardest for me to get around, and my datetime column is still recognized as 'object'. Is this normal? Shouldn't it be datetime64?

742

asked Oct 29 '18 20:10

Sam Floral

2 Answers

IIUC, it doesn't sound like you will need to continually read in the data from a changing Excel sheet(s). I would recommend reading in the Excel sheets as you have done and storing them in serialized pandas dataframes using to_pickle():

import pandas as pd

participants = ['P1','P2','P3']
physios = ['Ph1','Ph2','Ph3','Ph4','Ph5']

for p in participants:
    for ph in physios:
        df = pd.read_excel(p + r'.xlsx', sheet_name=ph)
        df.to_pickle(p + '_' + ph + r'.pkl')

You can now read these pickled dataframes much more efficiently since you don't have to incur all of the Excel overhead. A good discussion is available here.

answered Oct 23 '22 08:10

rahlf23

The dataset you are describing sounds like it's the sort of problem targeted by the dask project. It lets you use most of the standard pandas commands in parallel, out-of-memory.

The only problem is, dask doesn't have an excel reader from what I can tell. Since your question suggests the data don't fit in memory... you might want to manually convert the data to csv in excel, then you can simply:

# After pip install dask
import dask.dataframe as dd
df = dd.read_csv("./relpath/to/csvs/*.csv")
# Do data munging here
df.compute()

If that doesn't work, maybe it would be better if you try to load the data into spark or a database and do the transforms there.

Re: your question about time-series, start by reading the docs on this subject here.

answered Oct 23 '22 07:10

Charles Landau

Related questions
                            
                                How to apply coloring/formatting to the displayed text in input()-function (similar to print statement formatting)?
                            
                                Twine hangs without prompting for password
                            
                                404 error when using Google App Engine with flask and flask-restplus
                            
                                Flask - job not running as a background process
                            
                                Understanding multivariate time series classification with Keras
                            
                                How can I specify the flatten layer input size after many conv layers in PyTorch?
                            
                                Why pandas has its own datetime object Timestamp?
                            
                                Python type hinting with db-api
                            
                                Kernel ridge and simple Ridge with Polynomial features
                            
                                Concurrency and Selenium - Multiprocessing vs Multithreading
                            
                                How to use Keras generator with tf.data API
                            
                                Callable is invalid base class?
                            
                                In jupyter notebook, pressing tab print "ipynb_checkpoints/" instead of auto-completion
                            
                                How to get all the models (one for each set of parameters) using GridSearchCV?
                            
                                Early Stopping with a Cross-Validated Metric in Keras
                            
                                The amount of memory a Python set spends increases in steps
                            
                                Plot.ly: Different height for subplots with shared X-Axes
                            
                                BoostPython and CMake
                            
                                chunk topandas from spark dataframe
                            
                                Get last Twitter mention from API with Tweepy avoiding rate limit

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using Python to analyze large set of sensor-data

Tags:

python

pandas

excel

sensors

Sam Floral

People also ask

2 Answers

rahlf23

Charles Landau

Recent Activity

Donate For Us