Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm/Heuristic for grouping chat message histories by 'conversation'/implicit sessions from time stamps?

The problem: I have a series of chat messages -- between two users -- with time stamps. I could present, say, an entire day's worth of chat messages at once. During the entire day, however, there were multiple, discrete conversations/sessions...and it would be more useful to the user to see these divided up as opposed to all of the days as one continuous stream.

Is there an algorithm or heuristic that can 'deduce' implicit session/conversation starts/breaks from time stamps? Besides an arbitrary 'if the gap is more than x minutes, it's a separate session'. And if that is the only case, how is this interval determined? In any case, I'd like to avoid this.

For example, there are...fifty messages that get sent between 2:00 and 3:00, and then a break, and then twenty messages sent between 4:00 and 5:00. There would be a break inserted between there...but how would the break be determined?

I'm sure that there is already literature on this subject, but I just don't know what to search for.

I was playing around with things like edge detection algorithms and gradient-based approaches for a while.

(see comments for more clarification)

like image 812
Justin L. Avatar asked Jul 24 '12 20:07

Justin L.


1 Answers

EDIT (Better idea):

You can view each message as being of two types:

  1. A continuation of a previous conversation
  2. A brand new conversation

You can model these two types of messages as independent Poisson processes, where the time difference between adjacent messages is an exponential distribution.

You can then empirically determine the exponential parameters for these two types of messages by hand (wouldn't be too hard to do given some initial data). Now you have a model for these two events.

Finally when a new message comes along, you can calculate the probability of the message being of type 1 or type 2. If type 2, then you have a new conversation.

Clarification:

The probability of the message being a new conversation, given that the delay is some time T.

P(new conversation | delay=T) = P(new conversation AND delay=T)/P(delay=T)

Using Bayes' Rule:

= P(delay=T | new conversation)*P(new conversation)/P(delay=T)

The same calculation goes for P(old conversation | delay=T).

P(delay=T | new conversation) comes from the model. P(new conversation) is easily calculable from the data used to generate your model. P(delay=T) you don't need to calculate at all since all you want to do is compare the two probabilities.


The difference in timestamps between adjacent messages depends on the type of conversation and the people participating. Thus you'll want an algorithm that takes into account local characteristics, as opposed to a global threshold parameter.

My proposition would be as follows:

  1. Get the time difference between the last 10 adjacent messages.
  2. Compute the mean (or median)
  3. If the delay until the next message is more than 30 times the the mean, it's a new conversation.

Of course, I came up with these numbers on the spot. They would have to be tuned to fit your purpose.

like image 118
tskuzzy Avatar answered Sep 25 '22 02:09

tskuzzy