Following is the sample data set that I need to group together, if you look closely they are mostly similar text lines but with very minute difference of having either a person id or ID .
Unexpected error:java.lang.RuntimeException:Data not found for person 1X99999123 . Clear set not defined . Dump
Unexpected error:java.lang.RuntimeException:Data not found for person 2X99999123 . Clear set not defined . Dump
Unexpected error:java.lang.RuntimeException:Data not found for person 31X9393912 . Clear set not defined . Dump
Unexpected error:java.lang.RuntimeException:Data not found for person 36X9393912 . Clear set not defined . Dump
Exception in thread "main" javax.crypto.BadPaddingException: ID 1 Given final block not properly padded
Exception in thread "main" javax.crypto.BadPaddingException: ID 2 Given final block not properly padded
Unexpected error:java.lang.RuntimeException:Data not found for person 5 . Clear set not defined . Dump
Unexpected error:java.lang.RuntimeException:Data not found for person 6 . Clear set not defined . Dump
Exception in thread "main" java.lang.NullPointerException at TripleDESTest.encrypt(TripleDESTest.java:18)
I want to group them so that final result is like
Unexpected error:java.lang.RuntimeException:Data not found - 6
Exception in thread "main" javax.crypto.BadPaddingException - 2
Exception in thread "main" java.lang.NullPointerException at - 1
Is there an existing API or algorithm available to handle such cases ?
Thanks in Advance. Cheers Shakti
The question is tagged as machine learning, so I am going to suggest classification approach.
You can tokenize each string, and use all tokens from training set as possible boolean features - an instance has the feature, if it contains this token.
Now, using this data, you can build (for instance) a C4.5 - a decision tree from the data. Make sure the tree use trimming once it is build, and minimum number of examples per leaf >1.
Once the tree is built, the "clustering" is done by the tree itself! Each leaf contains the examples which are considered similar to each other.
You can now extract this data by traversing the classification tree and extracting the samples stored in each leaf into its relevant cluster.
Notes:
BadPaddingException.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With