Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Perceptual hash function for text [closed]

Does anyone knows a simple perceptual hash algorithm for text ? I took a look in the pHash function ph_texthash but I want a more simple algorithm. Preferably in Python. Thank you !

like image 240
Tarantula Avatar asked Nov 04 '22 19:11

Tarantula


1 Answers

A blog post about perceptual hash functions (in the imaging context):

  • http://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html

and some related python code (dealing with images, not text, but may be adaptable):

  • http://sprunge.us/WcVJ?py (53 LOC)

As I understand this short presentation about Perceptual Hashing of Textual Content, there are numerous approaches (in different dimensions such as the level of the text, linguistic or statistical approach, the model chosen to represent the text, ...), and the right one will depend on your domain and the problems you try to solve.

Also you might look into Locality-sensitive hashing, which

is a method of performing probabilistic dimension reduction of high-dimensional data. The basic idea is to hash the input items so that similar items are mapped to the same buckets with high probability (the number of buckets being much smaller than the universe of possible input items)

like image 113
miku Avatar answered Nov 09 '22 14:11

miku