Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Where can I find a corpus of search engine queries?

I'm interested in training a question-answering system on top of user-generated search queries but so far it looks like such data is not made available. Are there some research centers or industry labs that have compiled corpora of search-engine queries?

like image 911
mirazour Avatar asked Jun 02 '15 01:06

mirazour


2 Answers

There are a couple of datasets like this:

Yahoo Weboscope:- http://webscope.sandbox.yahoo.com/catalog.php?datatype=l

Yandex Datasets:- https://www.kaggle.com/c/yandex-personalized-web-search-challenge/data A part of Kaggle problem. You can sign up and download.

There are also AOL Query Logs and MSN Query Logs which had been publicised as part of shared tasks in past 10 years. I'm not sure if they are still public. However you can explore a bit.

like image 99
Aditya Avatar answered Oct 24 '22 07:10

Aditya


Weboscope/Kaggle data sets have some specific restrictions. I would suggest the TREC data sets, such as this dataset from 2009

like image 2
Doug T. Avatar answered Oct 24 '22 06:10

Doug T.