Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Normalizing a list of restaurant dishes

I have a large data set of restaurant dishes (for example, "Pulled Pork", "Beef Brisket"...)

I am trying to "normalize" (wrong word) the dishes. I want "Pulled Pork" and "Pulled Pork Sandwich" and "Jumbo Pork Slider" all to map to a single dish, "Pulled Pork".

So far I have gotten started with NLTK using Python and had some fun playing around with frequency distributions and such.

Does anyone have a high-level strategy to approach this problem? Perhaps some keywords I could google?

Thanks

like image 211
George B Avatar asked Aug 26 '15 19:08

George B


2 Answers

You might want look for TFIDF and cosine similarity.

There are challenging cases, however. Let's say you have the following three dishes:

  • Pulled pork
  • Pulled egg
  • Egg sandwich

Which of the two you are going to combine?

  • Pulled pork and pulled egg
  • Pulled egg and egg sandwich

Using TFIDF, you can find the most representative words. For example the word sandwich may happen to be in many dishes, hence not very representative. (Tuna sandwich, egg sandwich, cheese sandwich, etc.) Merging tuna sandwich and cheese sandwich may not be a good idea.

After you have the TFIDF vectors, you can use cosine similarity (using the TFIDF vectors) and maybe a static threshold, you can decide whether to merge them or not.

There is also another issue arises: When you match, what are you going to name them? (Pulled egg or egg sandwich?)

Update:

@alvas suggests to use clustering after having the similarity/dissimilarity values. I think that would be good idea. You can first create your nxn distance/similarity matrix using the cosine similarity with TFIDF vectors. And after you have the distance matrix, you can cluster them using a clustering algorithm.

like image 67
Sait Avatar answered Nov 17 '22 05:11

Sait


Sounds like you are effectively trying to do coreference resolution on named entities, where the entities are distinct dishes. You can check out projects like cort and nltk-drt.

However, from your example it's a little unclear why a pulled pork sandwich should be considered the same dish as pulled pork, so you may need a way to come up with your own training set (e.g. culled from google) that tags entities as distinct within your desired tolerance.

like image 30
lemonhead Avatar answered Nov 17 '22 03:11

lemonhead