I'm a researcher new to Python, and I have to analyze a large dataset that contains raw sensordata in an Excel format.
Each Excel-datafile is >100 MB's large for each study participant. The excelfile contains 5 sheets for the measurement of 5 different physiological parameters. Each sheet contains more than 1 million rows and two columns (time, physiological parameter).
After 1 million rows of sensordata, the data automatically continues in the following columns (C and D) in the Excel file.
Every time I try to load the datafile in Python, it takes forever. I was wondering several things:
1) How do I tell Python to read data from a specific Excel sheet? Is it normal that this takes so long?
This is what I tried:
df = pd.read_excel("filepath", sheet_name="Sheetname")
print (df.head (5))
2) Is it feasible to do data munging for this large datafile in Python with Pandas? I tried this to speed up the process:
import xlrd
work_book = xlrd.open_workbook('filepath', on_demand=True)
work_book.release_resources()
3) Later on: I want to compare the physiological parameters of different study participants. As this is a time-series analysis between study participants, how could I get started doing this in Python?
I've learned the basics of Python in a few days, and I love it so far. I realize I have a long way to go.
Update: I think I just finished the time-series analysis (actually just the trend-analysis, using the Dickey-Fuller test and rolling mean visualisation techniques)! :D Thank you all so much for your help!!! The 'datetime' module in pandas was the hardest for me to get around, and my datetime column is still recognized as 'object'. Is this normal? Shouldn't it be datetime64?
Sensor data analytics is an analytics platform built to analyse the data streamed or collected from sensors and IoT devices. The data is analysed to give insight into the current status of this device using different metrics (these metrics are set based on the organisation's needs).
Python provides a huge number of libraries to work on Big Data. You can also work – in terms of developing code – using Python for Big Data much faster than any other programming language. These two aspects are enabling developers worldwide to embrace Python as the language of choice for Big Data projects.
First, you extract the sensor's data to a CSV file. Then the CSV file is read by Python, and the extracted data is processed for display. Also, it is processed for taking wavelet transform.
IIUC, it doesn't sound like you will need to continually read in the data from a changing Excel sheet(s). I would recommend reading in the Excel sheets as you have done and storing them in serialized pandas
dataframes using to_pickle()
:
import pandas as pd
participants = ['P1','P2','P3']
physios = ['Ph1','Ph2','Ph3','Ph4','Ph5']
for p in participants:
for ph in physios:
df = pd.read_excel(p + r'.xlsx', sheet_name=ph)
df.to_pickle(p + '_' + ph + r'.pkl')
You can now read these pickled dataframes much more efficiently since you don't have to incur all of the Excel overhead. A good discussion is available here.
The dataset you are describing sounds like it's the sort of problem targeted by the dask
project. It lets you use most of the standard pandas
commands in parallel, out-of-memory.
The only problem is, dask doesn't have an excel reader from what I can tell. Since your question suggests the data don't fit in memory... you might want to manually convert the data to csv in excel, then you can simply:
# After pip install dask
import dask.dataframe as dd
df = dd.read_csv("./relpath/to/csvs/*.csv")
# Do data munging here
df.compute()
If that doesn't work, maybe it would be better if you try to load the data into spark or a database and do the transforms there.
Re: your question about time-series, start by reading the docs on this subject here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With