Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to group unknown text messages using an Algorithm?

Following is the sample data set that I need to group together, if you look closely they are mostly similar text lines but with very minute difference of having either a person id or ID .

Unexpected error:java.lang.RuntimeException:Data not found for person 1X99999123 . Clear set not defined . Dump
Unexpected error:java.lang.RuntimeException:Data not found for person 2X99999123 . Clear set not defined . Dump
Unexpected error:java.lang.RuntimeException:Data not found for person 31X9393912 . Clear set not defined . Dump
Unexpected error:java.lang.RuntimeException:Data not found for person 36X9393912 . Clear set not defined . Dump
Exception in thread "main" javax.crypto.BadPaddingException: ID 1 Given final block not properly padded
Exception in thread "main" javax.crypto.BadPaddingException: ID 2 Given final block not properly padded
Unexpected error:java.lang.RuntimeException:Data not found for person 5 . Clear set not defined . Dump
Unexpected error:java.lang.RuntimeException:Data not found for person 6 . Clear set not defined . Dump
Exception in thread "main" java.lang.NullPointerException at TripleDESTest.encrypt(TripleDESTest.java:18)

I want to group them so that final result is like

Unexpected error:java.lang.RuntimeException:Data not found - 6
Exception in thread "main" javax.crypto.BadPaddingException - 2
Exception in thread "main" java.lang.NullPointerException at - 1

Is there an existing API or algorithm available to handle such cases ?

Thanks in Advance. Cheers Shakti

like image 745
Shakti Avatar asked Nov 18 '25 04:11

Shakti


1 Answers

The question is tagged as machine learning, so I am going to suggest classification approach.

You can tokenize each string, and use all tokens from training set as possible boolean features - an instance has the feature, if it contains this token.

Now, using this data, you can build (for instance) a C4.5 - a decision tree from the data. Make sure the tree use trimming once it is build, and minimum number of examples per leaf >1.

Once the tree is built, the "clustering" is done by the tree itself! Each leaf contains the examples which are considered similar to each other.

You can now extract this data by traversing the classification tree and extracting the samples stored in each leaf into its relevant cluster.

Notes:

  • This algorithm will fail for the sample data you provided because it cannot handle well if one msg is unique (the NPE in your example) - it will probably be in the same leaf as BadPaddingException.
  • No need to reinvent the wheel - you can use weka - an open source Machine Learning library in java, or other existing libraries for the algorithms
  • Instead of using the tokens as binary features, they can also be numerical features, you can use where is the token in the string, is it the 1st or 10th token?
like image 185
amit Avatar answered Nov 20 '25 18:11

amit



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!