Im trying to build a 3x3 transition matrix with this data
days=['rain', 'rain', 'rain', 'clouds', 'rain', 'sun', 'clouds', 'clouds',
'rain', 'sun', 'rain', 'rain', 'clouds', 'clouds', 'sun', 'sun',
'clouds', 'clouds', 'rain', 'clouds', 'sun', 'rain', 'rain', 'sun',
'sun', 'clouds', 'clouds', 'rain', 'rain', 'sun', 'sun', 'rain',
'rain', 'sun', 'clouds', 'clouds', 'sun', 'sun', 'clouds', 'rain',
'rain', 'rain', 'rain', 'sun', 'sun', 'sun', 'sun', 'clouds', 'sun',
'clouds', 'clouds', 'sun', 'clouds', 'rain', 'sun', 'sun', 'sun',
'clouds', 'sun', 'rain', 'sun', 'sun', 'sun', 'sun', 'clouds',
'rain', 'clouds', 'clouds', 'sun', 'sun', 'sun', 'sun', 'sun', 'sun',
'clouds', 'clouds', 'clouds', 'clouds', 'clouds', 'sun', 'rain',
'rain', 'rain', 'clouds', 'sun', 'clouds', 'clouds', 'clouds', 'rain',
'clouds', 'rain', 'sun', 'sun', 'clouds', 'sun', 'sun', 'sun', 'sun',
'sun', 'sun', 'rain']
Currently, Im doing it with some temp dictionaries and some list that calculates the probability of each weather separately. Its not a pretty solution. Can someone please guide me with a more reasonable solution to this problem?
self.transitionMatrix=np.zeros((3,3))
#the columns are today
sun_total_count = 0
temp_dict={'sun':0, 'clouds':0, 'rain':0}
total_runs = 0
for (x, y), c in Counter(zip(data, data[1:])).items():
#if column 0 is sun
if x is 'sun':
#find the sum of all the numbers in this column
sun_total_count += c
total_runs += 1
if y is 'sun':
temp_dict['sun'] = c
if y is 'clouds':
temp_dict['clouds'] = c
if y is 'rain':
temp_dict['rain'] = c
if total_runs is 3:
self.transitionMatrix[0][0] = temp_dict['sun']/sun_total_count
self.transitionMatrix[1][0] = temp_dict['clouds']/sun_total_count
self.transitionMatrix[2][0] = temp_dict['rain']/sun_total_count
return self.transitionMatrix
for every type of weather I need to calculate the probability for the next day
If you don't mind using pandas
, there's a one-liner for extracting the transition probabilities:
pd.crosstab(pd.Series(days[1:],name='Tomorrow'),
pd.Series(days[:-1],name='Today'),normalize=1)
Output:
Today clouds rain sun
Tomorrow
clouds 0.40625 0.230769 0.309524
rain 0.28125 0.423077 0.142857
sun 0.31250 0.346154 0.547619
Here the (forward) probability that tomorrow will be sunny given that today it rained is found at the column 'rain', row 'sun'. If you would like to have backward probabilities (what might have been the weather yesterday given the weather today), switch the first two parameters.
If you would like to have the probabilities stored in rows rather than columns, then set normalize=0
but note that if you would do that directly on this example, you obtain backwards probabilities stored as rows. If you would like to obtain the same result as above but transposed you could a) yes, transpose or b) switch the order of the first two parameters and set normalize
to 0.
If you just want to keep the results as numpy
2-d array (and not as a pandas dataframe), type .values
after the last parenthesis.
I like a combination of pandas
and itertools
for this. The code block is a bit longer than the above, but don't conflate verbosity with speed. (The window
func should be very fast; the pandas portion will be slower admittedly.)
First, make a "window" function. Here's one from the itertools cookbook. This gets you to a list of tuples of transitions (state1 to state2).
from itertools import islice
def window(seq, n=2):
"""Sliding window width n from seq. From old itertools recipes."""
it = iter(seq)
result = tuple(islice(it, n))
if len(result) == n:
yield result
for elem in it:
result = result[1:] + (elem,)
yield result
# list(window(days))
# [('rain', 'rain'),
# ('rain', 'rain'),
# ('rain', 'clouds'),
# ('clouds', 'rain'),
# ('rain', 'sun'),
# ...
Then use a pandas groupby + value counts operation to get a transition matrix from each state1 to each state2:
import pandas as pd
pairs = pd.DataFrame(window(days), columns=['state1', 'state2'])
counts = pairs.groupby('state1')['state2'].value_counts()
probs = (counts / counts.sum()).unstack()
Your result looks like this:
print(probs)
state2 clouds rain sun
state1
clouds 0.13 0.09 0.10
rain 0.06 0.11 0.09
sun 0.13 0.06 0.23
Here is a "pure" numpy solution it creates 3x3 tables where the zeroth dim (row number) corresponds to today and the last dim (column number) corresponds to tomorrow.
The conversion from words to indices is done by truncating after the first letter and then using a lookup table.
For counting numpy.add.at
is used.
This was written with efficiency in mind. It does a million words in less than a second.
import numpy as np
report = [
'rain', 'rain', 'rain', 'clouds', 'rain', 'sun', 'clouds', 'clouds',
'rain', 'sun', 'rain', 'rain', 'clouds', 'clouds', 'sun', 'sun',
'clouds', 'clouds', 'rain', 'clouds', 'sun', 'rain', 'rain', 'sun',
'sun', 'clouds', 'clouds', 'rain', 'rain', 'sun', 'sun', 'rain',
'rain', 'sun', 'clouds', 'clouds', 'sun', 'sun', 'clouds', 'rain',
'rain', 'rain', 'rain', 'sun', 'sun', 'sun', 'sun', 'clouds', 'sun',
'clouds', 'clouds', 'sun', 'clouds', 'rain', 'sun', 'sun', 'sun',
'clouds', 'sun', 'rain', 'sun', 'sun', 'sun', 'sun', 'clouds',
'rain', 'clouds', 'clouds', 'sun', 'sun', 'sun', 'sun', 'sun', 'sun',
'clouds', 'clouds', 'clouds', 'clouds', 'clouds', 'sun', 'rain',
'rain', 'rain', 'clouds', 'sun', 'clouds', 'clouds', 'clouds', 'rain',
'clouds', 'rain', 'sun', 'sun', 'clouds', 'sun', 'sun', 'sun', 'sun',
'sun', 'sun', 'rain']
# create np array, keep only first letter (by forcing dtype)
# obviously, this only works because rain, sun, clouds start with different
# letters
# cast to int type so we can use for indexing
ri = np.array(report, dtype='|S1').view(np.uint8)
# create lookup
c, r, s = 99, 114, 115 # you can verify this using chr and ord
lookup = np.empty((s+1,), dtype=int)
lookup[[c, r, s]] = np.arange(3)
# translate c, r, s to 0, 1, 2
rc = lookup[ri]
# get counts (of pairs (today, tomorrow))
cnts = np.zeros((3, 3), dtype=int)
np.add.at(cnts, (rc[:-1], rc[1:]), 1)
# or as probs
probs = cnts / cnts.sum()
# or as condional probs (if today is sun how probable is rain tomorrow etc.)
cond = cnts / cnts.sum(axis=-1, keepdims=True)
print(cnts)
print(probs)
print(cond)
# [13 9 10]
# [ 6 11 9]
# [13 6 23]]
# [[ 0.13 0.09 0.1 ]
# [ 0.06 0.11 0.09]
# [ 0.13 0.06 0.23]]
# [[ 0.40625 0.28125 0.3125 ]
# [ 0.23076923 0.42307692 0.34615385]
# [ 0.30952381 0.14285714 0.54761905]]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With