I have a large data set of restaurant dishes (for example, "Pulled Pork", "Beef Brisket"...)
I am trying to "normalize" (wrong word) the dishes. I want "Pulled Pork" and "Pulled Pork Sandwich" and "Jumbo Pork Slider" all to map to a single dish, "Pulled Pork".
So far I have gotten started with NLTK using Python and had some fun playing around with frequency distributions and such.
Does anyone have a high-level strategy to approach this problem? Perhaps some keywords I could google?
Thanks
You might want look for TFIDF
and cosine similarity
.
There are challenging cases, however. Let's say you have the following three dishes:
Which of the two you are going to combine?
Using TFIDF
, you can find the most representative words. For example the word sandwich may happen to be in many dishes, hence not very representative. (Tuna sandwich, egg sandwich, cheese sandwich, etc.) Merging tuna sandwich and cheese sandwich may not be a good idea.
After you have the TFIDF vectors, you can use cosine similarity (using the TFIDF vectors) and maybe a static threshold, you can decide whether to merge them or not.
There is also another issue arises: When you match, what are you going to name them? (Pulled egg or egg sandwich?)
@alvas suggests to use clustering after having the similarity/dissimilarity values. I think that would be good idea. You can first create your nxn
distance/similarity matrix using the cosine similarity with TFIDF vectors. And after you have the distance matrix, you can cluster them using a clustering algorithm.
Sounds like you are effectively trying to do coreference resolution on named entities, where the entities are distinct dishes. You can check out projects like cort
and nltk-drt
.
However, from your example it's a little unclear why a pulled pork sandwich should be considered the same dish as pulled pork, so you may need a way to come up with your own training set (e.g. culled from google) that tags entities as distinct within your desired tolerance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With