I was analyzing some geographical data and attempting to predict/forecast next occurrence of event with respect to time and it geographical position. The data was in following order (with sample data)
Timestamp Latitude Longitude Event 13307266 102.86400972 70.64039541 "Event A" 13311695 102.8082912 70.47394645 "Event A" 13314940 102.82240522 70.6308513 "Event A" 13318949 102.83402128 70.64103035 "Event A" 13334397 102.84726242 70.66790352 "Event A"
First step was classifying it into 100 zones, so that reduces dimensions and complexity.
Timestamp Zone 13307266 47 13311695 65 13314940 51 13318949 46 13334397 26
Next step was to do time series analysis then I got stuck here for 2 months, read around a lot of literature and figured these were my options * ARIMA (auto-regression method) * Machine Learning
I wanted to utilize Machine learning to forecast using python but couldn't really figure out how.Specifically are there any python libraries/open-source-code specific for use case, which I can build upon.
EDIT 1: To clarify, data is loosely dependent on past data but over a period of time is uniformly distributed. The best way to visualize the data would be, to imagine N number of agents controlled by a algorithm which allots them task of picking resource from grids. Resources are function of socioeconomic structure of society and also strongly dependent on geography. Its in interest of " algorithm " to be able to predict demand zone and time wise.
p.s: For Auto-regressive models like ARIMA Python already has a library http://pypi.python.org/pypi/statsmodels .
Some examples of ML forecasting models used in business applications are: Artificial neural network. Long short-term-memory-based neural network. Random forest.
At its most basic, machine learning uses programmed algorithms that receive and analyse input data to predict output values within an acceptable range. As new data is fed to these algorithms, they learn and optimise their operations to improve performance, developing 'intelligence' over time.
It is an extension of simple exponential smoothing to allow forecasting of data with a trend. This method takes into account the trend of the dataset. The forecast function in this method is a function of level and trend.
Without example data or existing code I can't offer you anything concrete.
However, often it's helpful to re-phrase your problem in the nomenclature of the field you want to explore. In ML terms:
So I'd say you have a supervised classification problem. As an aside you may want to do some sort of time regularisation first; I'm guessing there are going to be patterns of the events depending on what time of the day, day of the month, or month of the year it is, and you may want to represent this as an additional feature.
Taking a look at one of the popular Python ML libraries available, scikit-learn, here:
http://scikit-learn.org/stable/supervised_learning.html
and consulting a recent posting on a cheatsheet for scikit-learn by one of the contributors:
http://peekaboo-vision.blogspot.de/2013/01/machine-learning-cheat-sheet-for-scikit.html
Your first good bet would be to try Support Vector Machines (SVM), and if that fails maybe give k Nearest Neighbours (kNN) a shot as well. Note that using an ensemble classifier is usually superior than using just one instance of a given SVM/kNN.
How, exactly, to apply SVM/kNN with time as a feature may require more research, since AFAIK (and others will probably correct me) SVM/kNN require bounded inputs with a mean of zero (or normalised to have a mean of zero). Just doing some random Googling you may be able to find certain SVM kernels, for example a Fourier kernel, that can transform a time-series feature for you:
SVM Kernels for Time Series Analysis
http://www.stefan-rueping.de/publications/rueping-2001-a.pdf
scikit-learn handily allows you to specify a custom kernel for an SVM. See:
http://scikit-learn.org/stable/auto_examples/svm/plot_custom_kernel.html#example-svm-plot-custom-kernel-py
With your knowledge of ML nomenclature, and example data in hand, you may want to consider posting the question to Cross Validated, the statistics Stack Exchange.
EDIT 1: Thinking about this problem more you need to really understand if your features and corresponding labels are independent and identically distributed (IID) or not. For example what if you were modelling how forest fires spread over time. It's clear that the likelihood of a given zone catches fire is contingent on its neighbours being on fire or not. AFAIK SVM and kNN assume the data is IID. At this point I'm starting to get out of my depth, but I think you should at least give several ML methods a shot and see what happens! Remember to cross-validate! (scikit-learn does this for you).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With