Does anyone knows a simple perceptual hash algorithm for text ? I took a look in the pHash function ph_texthash
but I want a more simple algorithm.
Preferably in Python. Thank you !
A blog post about perceptual hash functions (in the imaging context):
and some related python code (dealing with images, not text, but may be adaptable):
As I understand this short presentation about Perceptual Hashing of Textual Content, there are numerous approaches (in different dimensions such as the level of the text, linguistic or statistical approach, the model chosen to represent the text, ...), and the right one will depend on your domain and the problems you try to solve.
Also you might look into Locality-sensitive hashing, which
is a method of performing probabilistic dimension reduction of high-dimensional data. The basic idea is to hash the input items so that similar items are mapped to the same buckets with high probability (the number of buckets being much smaller than the universe of possible input items)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With