Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Designing a twenty questions algorithm

I am interested in writing a twenty questions algorithm similar to what akinator and, to a lesser extent, 20q.net uses. The latter seems to focus more on objects, explicitly telling you not to think of persons or places. One could say that akinator is more general, allowing you to think of literally anything, including abstractions such as "my brother".

The problem with this is that I don't know what algorithm these sites use, but from what I read they seem to be using a probabilistic approach in which questions are given a certain fitness based on how many times they have lead to correct guesses. This SO question presents several techniques, but rather vaguely, and I would be interested in more details.

So, what could be an accurate and efficient algorithm for playing twenty questions?

I am interested in details regarding:

  1. What question to ask next.
  2. How to make the best guess at the end of the 20 questions.
  3. How to insert a new object and a new question into the database.
  4. How to query (1, 2) and update (3) the database efficiently.

I realize this may not be easy and I'm not asking for code or a 2000 words presentation. Just a few sentences about each operation and the underlying data structures should be enough to get me started.

like image 766
IVlad Avatar asked Feb 06 '11 20:02

IVlad


2 Answers

Update, 10+ years later

I'm now hosting a (WIP, but functional) implementation here: https://twentyq.evobyte.org/ with the code here: https://github.com/evobyte-apps/open-20-questions. It's based on the same rough idea listed below.


Well, over three years later, I did it (although I didn't work full time on it). I hosted a crude implementation at http://twentyquestions.azurewebsites.net/ if anyone is interested (please don't teach it too much wrong stuff yet!).

It wasn't that hard, but I would say it's the non-intuitive kind of not hard that you don't immediately think of. My methods include some trivial fitness-based ranking, ideas from reinforcement learning and a round-robin method of scheduling new questions to be asked. All of this is implemented on a normalized relational database.

My basic ideas follow. If anyone is interested, I will share code as well, just contact me. I plan on making it open source eventually, but once I have done a bit more testing and reworking. So, my ideas:

  • an Entities table that holds the characters and objects played;
  • a Questions table that holds the questions, which are also submitted by users;
  • an EntityQuestions table holds entity-question relations. This holds the number of times each answer was given for each question in relation to each entity (well, those for which the question was asked for anyway). It also has a Fitness field, used for ranking questions from "more general" down to "more specific";
  • a GameEntities table is used for ranking the entities according to the answers given so far for each on-going game. An answer of A to a question Q pushes up all the entities for which the majority answer to question Q is A;
  • The first question asked is picked from those with the highest sum of fitnesses across the EntityQuestions table;
  • Each next question is picked from those with the highest fitness associated with the currently top entries in the GameEntities table. Questions for which the expected answer is Yes are favored even before the fitness, because these have more chances of consolidating the current top ranked entity;
  • If the system is quite sure of the answer even before all 20 questions have been asked, it will start asking questions not associated with its answer, so as to learn more about that entity. This is done in a round-robin fashion from the global questions pool right now. Discussion: is round-robin fine, or should it be fully random?
  • Premature answers are also given under certain conditions and probabilities;
  • Guesses are given based on the rankings in GameEntities. This allows the system to account for lies as well, because it never eliminates any possibility, just decreases its likeliness of being the answer;
  • After each game, the fitness and answers statistics are updated accordingly: fitness values for entity-question associations decrease if the game was lost, and increase otherwise.

I can provide more details if anyone is interested. I am also open to collaborating on improving the algorithms and implementation.

like image 191
IVlad Avatar answered Sep 21 '22 18:09

IVlad


This is a very interesting question. Unfortunately I don't have a full answer, let me just write down the ideas I could come up with in 10 minutes:

  • If you are able to halve the set of available answers on each question, you can distinguish between 2^20 ~ 1 million "objects". Your set is probably going to be larger, so it's right to assume that sometimes you have to make a guess.
  • You want to maximize utility. Some objects are chosen more often than others. If you want to make good guesses you have to take into consideration the weight of each object (= the probability of that object being picked) when creating the tree.
  • If you trust a little bit of your users you can gain knowledge based on their answers. This also means that you cannot use a static tree to ask questions because then you'll get the answers for the same questions.. and you'll learn nothing new if you encounter with the same object.
  • If a simple question is not able to divide the set to two halves, you could combine them to get better results: eg: "is the object green or blue?". "green or has a round shape?"
like image 33
Karoly Horvath Avatar answered Sep 24 '22 18:09

Karoly Horvath