The problem: I have a series of chat messages -- between two users -- with time stamps. I could present, say, an entire day's worth of chat messages at once. During the entire day, however, there were multiple, discrete conversations/sessions...and it would be more useful to the user to see these divided up as opposed to all of the days as one continuous stream.
Is there an algorithm or heuristic that can 'deduce' implicit session/conversation starts/breaks from time stamps? Besides an arbitrary 'if the gap is more than x minutes, it's a separate session'. And if that is the only case, how is this interval determined? In any case, I'd like to avoid this.
For example, there are...fifty messages that get sent between 2:00 and 3:00, and then a break, and then twenty messages sent between 4:00 and 5:00. There would be a break inserted between there...but how would the break be determined?
I'm sure that there is already literature on this subject, but I just don't know what to search for.
I was playing around with things like edge detection algorithms and gradient-based approaches for a while.
(see comments for more clarification)
EDIT (Better idea):
You can view each message as being of two types:
You can model these two types of messages as independent Poisson processes, where the time difference between adjacent messages is an exponential distribution.
You can then empirically determine the exponential parameters for these two types of messages by hand (wouldn't be too hard to do given some initial data). Now you have a model for these two events.
Finally when a new message comes along, you can calculate the probability of the message being of type 1 or type 2. If type 2, then you have a new conversation.
Clarification:
The probability of the message being a new conversation, given that the delay is some time T
.
P(new conversation | delay=T) = P(new conversation AND delay=T)/P(delay=T)
Using Bayes' Rule:
= P(delay=T | new conversation)*P(new conversation)/P(delay=T)
The same calculation goes for P(old conversation | delay=T)
.
P(delay=T | new conversation)
comes from the model. P(new conversation)
is easily calculable from the data used to generate your model. P(delay=T)
you don't need to calculate at all since all you want to do is compare the two probabilities.
The difference in timestamps between adjacent messages depends on the type of conversation and the people participating. Thus you'll want an algorithm that takes into account local characteristics, as opposed to a global threshold parameter.
My proposition would be as follows:
Of course, I came up with these numbers on the spot. They would have to be tuned to fit your purpose.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With