How to group unknown text messages using an Algorithm?

Question

Following is the sample data set that I need to group together, if you look closely they are mostly similar text lines but with very minute difference of having either a person id or ID .

Unexpected error:java.lang.RuntimeException:Data not found for person 1X99999123 . Clear set not defined . Dump
Unexpected error:java.lang.RuntimeException:Data not found for person 2X99999123 . Clear set not defined . Dump
Unexpected error:java.lang.RuntimeException:Data not found for person 31X9393912 . Clear set not defined . Dump
Unexpected error:java.lang.RuntimeException:Data not found for person 36X9393912 . Clear set not defined . Dump
Exception in thread "main" javax.crypto.BadPaddingException: ID 1 Given final block not properly padded
Exception in thread "main" javax.crypto.BadPaddingException: ID 2 Given final block not properly padded
Unexpected error:java.lang.RuntimeException:Data not found for person 5 . Clear set not defined . Dump
Unexpected error:java.lang.RuntimeException:Data not found for person 6 . Clear set not defined . Dump
Exception in thread "main" java.lang.NullPointerException at TripleDESTest.encrypt(TripleDESTest.java:18)

I want to group them so that final result is like

Unexpected error:java.lang.RuntimeException:Data not found - 6
Exception in thread "main" javax.crypto.BadPaddingException - 2
Exception in thread "main" java.lang.NullPointerException at - 1

Is there an existing API or algorithm available to handle such cases ?

Thanks in Advance. Cheers Shakti

amit · Accepted Answer

The question is tagged as machine learning, so I am going to suggest classification approach.

You can tokenize each string, and use all tokens from training set as possible boolean features - an instance has the feature, if it contains this token.

Now, using this data, you can build (for instance) a C4.5 - a decision tree from the data. Make sure the tree use trimming once it is build, and minimum number of examples per leaf >1.

Once the tree is built, the "clustering" is done by the tree itself! Each leaf contains the examples which are considered similar to each other.

You can now extract this data by traversing the classification tree and extracting the samples stored in each leaf into its relevant cluster.

Notes:

This algorithm will fail for the sample data you provided because it cannot handle well if one msg is unique (the NPE in your example) - it will probably be in the same leaf as BadPaddingException.
No need to reinvent the wheel - you can use weka - an open source Machine Learning library in java, or other existing libraries for the algorithms
Instead of using the tokens as binary features, they can also be numerical features, you can use where is the token in the string, is it the 1st or 10th token?

How to group unknown text messages using an Algorithm?

Tags:

java

algorithm

machine-learning

Shakti

1 Answers

amit

Recent Activity

Donate For Us

How to group unknown text messages using an Algorithm?

Tags:

java

algorithm

machine-learning

Shakti

1 Answers

amit

Related questions

Recent Activity

Donate For Us