Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sequence prediction of characters?

I am new to machine learning, so please go easy in case the problem is trivial.

I have been given a sequence of observed characters say, ABABBABBB..... (n characters). My goal is to predict the next characters by some "learning" mechanisms. My constraints are that the obeserved characters (training data?) is not too much i.e I have say a sequence of length 6000 to learn the underlying patterm

I am quite confused about what strategy to take to solve this problem,my initial bets are: 1) some kind of ngram model? 2) Neural networks (the likes of LSTM) etc.? 3) HMMs

Can you please give directions of the right approaches for solving this problem?

like image 761
suzee Avatar asked Mar 12 '17 13:03

suzee


People also ask

What is sequence prediction?

Sequence prediction is a popular machine learning task, which consists of predicting the next symbol(s) based on the previously observed sequence of symbols. These symbols could be a number, an alphabet, a word, an event, or an object like a webpage or product. For example: A sequence of words or characters in a text.

How do you predict numbers in a sequence?

First, find the common difference for the sequence. Subtract the first term from the second term. Subtract the second term from the third term. To find the next value, add to the last given number.

What is sequence classification?

Sequence classification is the task of predicting a class label given a sequence of observations. In many applications such as healthcare monitoring or intrusion detection, early classification is crucial to prompt intervention.

What is sequence learning in ML?

What is The Sequential Learning? Machine learning models that input or output data sequences are known as sequence models. Text streams, audio clips, video clips, time-series data, and other types of sequential data are examples of sequential data.


2 Answers

If you're dealing with a fairly trivial pattern, where letters are based only off of the previous one, then you're spot on that a Hidden Markov Model (HMM) would solve it - in fact something as simple as a Markov Chain would work.

If you want to have a little fun, then here's a custom solution based on the HMM that you can fiddle around with.


Go over the sample data, and create a linked list of every element, in the order they were inserted. Now make another list for each different character, and put the index of every list element where it belongs. Here's a (very poorly drawn) visual representation of the linked list, and the buckets below it:

Linked list above arrays

Now when you are presented a sequence, and asked to predict the next character, all you have to do is look at the most recent X characters, and see how sub-sequences that were similar to it acted.

To use my example above, then look at the most recent (the last) 3 characters to get BAC. You want to see if the sequence BAC has ever happened before, and what came after it when it did happen. If you check the bucket for the first letter of BAC (the B), you can see that the letter B showed up once before. Thankfully it follows the sequence - and an A came after it, so that will be the prediction.


You might want to not only check sequences of the past X, but also every number below X, giving each one less weight if the sequence matches, to create a better heuristic.

The difficult part is deciding how far behind to look - if you look too far, it'll take too long, and you might not get any matches. If you look too short, then you might miss a pattern, and have to guess.

Good luck - hopefully this is nice and easy to implement, and works for you.

like image 160
Addison Avatar answered Oct 25 '22 20:10

Addison


Your problem looks to be a time-series analysis. For that you should also take the use of statistics and Exploratory Data Analysis (EDA) besides machine-learning algorithms into account.

  1. I would start by assigning numbers to chars (A->1 , B->2 , and so on). It is usually not recommended to turn nominal variables (values without order) into ordinal ones, (2 is bigger than 1 but is "C" bigger than "A" or "Red" bigger than "Green"?!) but in this case, this will change your problem into an absolute time-series analysis one.

  2. Then I would utilize some routine EDA approaches, like 4-plot or autocorrelation analysis. this would tell you a lot about the statistical behavior of data, like "is the mean of data shifting?" or "how much random the dataset might be?" Afterwards you probably would have a better decision on which machine learning algorithm to use

  3. Depending on what you might find in the EDA analysis, you can go ahead with implementing ML algorithms. If you have a high correlated data (perceived from autocorrelation plot) then you would probably choose a sliding window approach in your feature selection i.e. assuming that each value depends on previous k values ( x_k = f(x_(k-1),x_(k-2),...,x_(k-m)) ). the value of "m" could be selected by analyzing autocorrelation plot. If you have a moving average, it would be a good idea to learn the average curve first and then go on with learning the offset of each instance from its average. If you perceived a degree of randomness in either the average curve or instance offsets, then you might want to choose a stochastic approach through your prediction problem.

Generally, EDA philosophy is "analysis should come before model selection" and I think it's true. If you know more about what you're dealing with, then you definitely will have a better call for model selection

like image 29
Alireza Avatar answered Oct 25 '22 20:10

Alireza