I'm interested in training a question-answering system on top of user-generated search queries but so far it looks like such data is not made available. Are there some research centers or industry labs that have compiled corpora of search-engine queries?
There are a couple of datasets like this:
Yahoo Weboscope:- http://webscope.sandbox.yahoo.com/catalog.php?datatype=l
Yandex Datasets:- https://www.kaggle.com/c/yandex-personalized-web-search-challenge/data A part of Kaggle problem. You can sign up and download.
There are also AOL Query Logs and MSN Query Logs which had been publicised as part of shared tasks in past 10 years. I'm not sure if they are still public. However you can explore a bit.
Weboscope/Kaggle data sets have some specific restrictions. I would suggest the TREC data sets, such as this dataset from 2009
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With