research-article

Discovering tasks from search engine query logs

Authors:
Claudio Lucchese

ISTI-CNR, Pisa, Italy

ISTI-CNR, Pisa, Italy
View Profile

,
Salvatore Orlando

Università Ca' Foscari Venezia, Italy

Università Ca' Foscari Venezia, Italy
View Profile

,
Raffaele Perego

ISTI-CNR, Pisa, Italy

ISTI-CNR, Pisa, Italy
View Profile

,
Fabrizio Silvestri

ISTI-CNR, Pisa, Italy

ISTI-CNR, Pisa, Italy
View Profile

,
Gabriele Tolomei

Università Ca' Foscari Venezia, Italy

Università Ca' Foscari Venezia, Italy
View Profile

Authors Info & Claims

ACM Transactions on Information Systems Volume 31 Issue 3Article No.: 14pp 1–43https://doi.org/10.1145/2493175.2493179

Published:05 August 2013Publication History

ACM Transactions on Information Systems

Abstract

Although Web search engines still answer user queries with lists of ten blue links to webpages, people are increasingly issuing queries to accomplish their daily tasks (e.g., finding a recipe, booking a flight, reading online news, etc.). In this work, we propose a two-step methodology for discovering tasks that users try to perform through search engines. First, we identify user tasks from individual user sessions stored in search engine query logs. In our vision, a user task is a set of possibly noncontiguous queries (within a user search session), which refer to the same need. Second, we discover collective tasks by aggregating similar user tasks, possibly performed by distinct users. To discover user tasks, we propose query similarity functions based on unsupervised and supervised learning approaches. We present a set of query clustering methods that exploit these functions in order to detect user tasks. All the proposed solutions were evaluated on a manually-built ground truth, and two of them performed better than state-of-the-art approaches. To detect collective tasks, we propose four methods that cluster previously discovered user tasks, which in turn are represented by the bag-of-words extracted from their composing queries. These solutions were also evaluated on another manually-built ground truth.

References

Baeza-Yates, R., Gionis, A., Junqueira, F. P., Murdock, V., Plachouras, V., and Silvestri, F. 2008. Design trade-offs for search engine caching. ACM Trans. Web 2, 4, 1--28. Google ScholarDigital Library
Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston, MA. Google ScholarDigital Library
Beeferman, D. and Berger, A. 2000. Agglomerative clustering of a search engine query log. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'00). ACM, New York, NY. 407--416. Google ScholarDigital Library
Boldi, P., Bonchi, F., Castillo, C., Donato, D., Gionis, A., and Vigna, S. 2008. The query-flow graph: Model and applications. In Proceedings of the 17th ACM International Conference on Information and Knowledge Management (CIKM'08). ACM, New York, NY, 609--618. Google ScholarDigital Library
Broder, A. 2002. A taxonomy of Web search. SIGIR Forum 36, 2, 2, 3--10. Google ScholarDigital Library
Cao, H., Jiang, D., Pei, J., He, Q., Liao, Z., Chen, E., and Li, H. 2008. Context-aware query suggestion by mining click-through and session data. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'08). ACM, New York, NY, 875--883. Google ScholarDigital Library
Donato, D., Bonchi, F., Chi, T., and Maarek, Y. 2010. Do you want to take notes&quest;: Identifying research missions in Yahoo&excl; Search Pad. In Proceedings of the 19th International Conference on World Wide Web (WWW'10). ACM, New York, NY, 321--330. Google ScholarDigital Library
Ester, M., Kriegel, H. P., Sander, J., and Xu, X. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'96). ACM, New York, NY, 226--231.Google Scholar
Fu, L., Goh, D. H.-L., Foo, S.S.-B., and Na, J.-C. 2003. Collaborative querying through a hybrid query clustering approach. In Proceedings of the 6th International Conference on Asian Digital Libraries (ICADL'03). Lecture Notes in Computer Science, vol. 2911, Springer-Verlag, Berlin Heidelberg, 111--122.Google Scholar
Gabrilovich, E. and Markovitch, S. 2007. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the 20th International Joint Conference on Artificial Intelligence. 6--12. Google ScholarDigital Library
Gayo-Avello, D. 2009. A survey on session detection methods in query logs and a proposal for future evaluation. Info. Sci. 179, 12, 1822--1843. Google ScholarDigital Library
Glance, N. S. 2001. Community search assistant. In Proceedings of the 6th ACM International Conference on Intelligent User Interfaces (IUI'01). ACM, New York, NY, 91--96. Google ScholarDigital Library
Guo, J., Cheng, X., Xu, G., and Zhu, X. 2011. Intent-aware query similarity. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM'11). ACM, New York, NY, 259--268. Google ScholarDigital Library
He, D. and Göker, A. 2000. Detecting session boundaries from Web user logs. In Proceedings of the 22nd Annual Colloquium on Information Retrieval Research (BCS-IRSG). 57--66.Google Scholar
He, D., Göker, A., and Harper, D. J. 2002. Combining evidence for automatic web session identification. Info. Process. Manage. 38, 5, 727--742. Google ScholarDigital Library
Hopcroft, J. and Tarjan, R. 1973. Algorithm 447: Efficient algorithms for graph manipulation. Commun. ACM 16, 6, 372--378. Google ScholarDigital Library
Jansen, B. J. and Spink, A. 2006. How are we searching the world wide Web&quest;: A comparison of nine search engine transaction logs. Info. Process. Manage. 42, 1, 248--263. Google ScholarDigital Library
Jansen, B. J., Spink, A., Bateman, J., and Saracevic, T. 1998. Real life information retrieval: A study of user queries on the web. SIGIR Forum 32, 1, 5--17. Google ScholarDigital Library
Jansen, B. J., Spink, A., Blakely, C., and Koshman, S. 2007. Defining a session on Web search engines: Research articles. J. Amer. Soci. Info. Scie. Technol. 58, 6, 862--871. Google ScholarDigital Library
Järvelin, A., Järvelin, A., and Järvelin, K. 2007. s-grams: Defining generalized n-grams for information retrieval. Info. Process. Manage. 43, 4, 1005--1019. Google ScholarDigital Library
Jones, R. and Klinkner, K. L. 2008. Beyond the session timeout: Automatic hierarchical segmentation of search topics in query logs. In Proceedings of the 17th ACM International Conference on Information and Knowledge Management (CIKM'08). ACM, New York, NY, 699--708. Google ScholarDigital Library
Kotov, A., Bennett, P. N., White, R. W., Dumais, S. T., and Teevan, J. 2011. Modeling and analysis of cross-session search tasks. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'11). ACM, New York, NY, 5--14. Google ScholarDigital Library
Lau, T. and Horvitz, E. 1999. Patterns of search: Analyzing and modeling Web query refinement. In Proceedings of the 7th International Conference on User Modeling. Springer-Verlag, Berlin, 119--128. Google ScholarDigital Library
Leacock, C. and Chodorow, M. 1998. Combining Local Context and WordNet Similarity for Word Sense Identification. The MIT Press, Cambridge, MA, 11, 265--283.Google Scholar
Lee, U., Liu, Z., and Cho, J. 2005. Automatic identification of user goals in Web search. In Proceedings of the 14th International World Wide Web Conference (WWW'05). ACM, New York, NY, 391--400. Google ScholarDigital Library
Lesk, M. 1986. Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. In Proceedings of the 5th ACM International Conference on Systems Documentation (SIGDOC'86). ACM, New York, NY, 24--26. Google ScholarDigital Library
Leung, K. W. T., Ng, W., and Lee, D. L. 2008. Personalized concept-based clustering of search engine queries. IEEE Trans. Knowl. Data Engi. 20, 11, 1505--1518. Google ScholarDigital Library
Lucchese, C., Orlando, S., Perego, R., Silvestri, F., and Tolomei, G. 2011. Identifying task-based sessions in search engine query logs. In Proceedings of the 4th ACM International Conference on Web Search and Data Mining (WSDM'11). ACM, New York, NY, 277--286. Google ScholarDigital Library
MacQueen, J. B. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, L. M. Le Cam and J. Neyman Eds., Vol. 1. University of California Press, Berkeley, CA, 281--297.Google Scholar
Mei, Q., Klinkner, K., Kumar, R., and Tomkins, A. 2009. An analysis framework for search sequences. In Proceeding of the 18th Conference on Information and Knowledge Management (CIKM'09). ACM, New York, NY, 1991--1994. Google ScholarDigital Library
Milne, D. and Witten, I. H. 2008. An effective, low-cost measure of semantic relatedness obtained from wikipedia links. In Proceedings of the 22nd Conference on Artificial Intelligence (AAAI'08). AAAI Press, Menlo Park, CA, 25--30.Google Scholar
Ozmutlu, H. C. and çavdur, F. 2005. Application of automatic topic identification on excite web search engine data logs. Info. Process. Manage. 41, 5, 1243--1262. Google ScholarDigital Library
Porter, M. F. 1980. An Algorithm for Suffix Stripping Vol. 14. Morgan Kaufmann Publishers, San Francisco, CA, 130--137.Google Scholar
Quinlan, J. R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Francisco, CA. Google ScholarDigital Library
Rada, R., Mili, H., Bicknell, E., and Blettner, M. 1989. Development and application of a metric on semantic nets. IEEE Trans. Syst. Man Cybernet. 19, 1, 17--30.Google ScholarCross Ref
Radlinski, F. and Joachims, T. 2005. Query chains: Learning to rank from implicit feedback. In Proceedings of the KDD Cup Workshop at the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'05). ACM, New York, NY, 239--248. Google ScholarDigital Library
Raghavan, V. V. and Sever, H. 1995. On the reuse of past optimal queries. In Proceedings of the 18th ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR'95). ACM, New York, NY, 344--350. Google ScholarDigital Library
Reed, W. 2001. The Pareto, zipf and other power laws. Econ. Lett. 74, 1, 15--19.Google ScholarCross Ref
Resnik, P. 1995. Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI). 448--453. Google ScholarDigital Library
Richardson, M. 2008. Learning about the world through long-term query logs. ACM Trans. Web 2, 4, 1--27. Google ScholarDigital Library
Rose, D. E. and Levinson, D. 2004. Understanding user goals in web search. In Proceedings of the 13th International World Wide Web Conference (WWW'04). ACM, New York, NY, 13--19. Google ScholarDigital Library
Salton, G. and Mcgill, M. J. 1986. Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New York, NY. Google ScholarDigital Library
Seco, N. and Cardoso, N. 2006. Detecting user sessions in the tumba&excl; web log. Tech. rep. Faculdade de Ciências da Universidade de Lisboa.Google Scholar
Shen, X., Tan, B., and Zhai, C. 2005. Implicit user modeling for personalized search. In Proceeding of the 14th Conference on Information and Knowledge Management (CIKM'05). ACM, New York, NY, 824--831. Google ScholarDigital Library
Shi, X. and Yang, C. C. 2006. Mining related queries from search engine query logs. In Proceedings of the 15th International World Wide Web Conference (WWW'06). ACM, New York, NY, 943--944. Google ScholarDigital Library
Silverstein, C., Marais, H., Henzinger, M., and Moricz, M. 1999. Analysis of a very large Web search engine query log. SIGIR Forum 33, 1, 6--12. Google ScholarDigital Library
Silvestri, F. 2010. Mining Query Logs: Turning search usage data into knowledge. Found. Trends Info. Ret. 1, 1--2, 1--174. Google ScholarDigital Library
Silvestri, F., Baraglia, R., Lucchese, C., Orlando, S., and Perego, R. 2008. (Query) history teaches everything, including the future. In Proceedings of the 6th Latin American Web Congress (LA-WEB'08). IEEE Computer Society, Washington, DC, 12--22. Google ScholarDigital Library
Spink, A., Park, M., Jansen, B. J., and Pedersen, J. 2006. Multitasking during Web search sessions. Info. Process. Manage. 42, 1, 264--275. Google ScholarDigital Library
Tan, P. N., Steinbach, M., and Kumar, V. 2005. Introduction to Data Mining. Addison-Wesley, Boston, MA. Google ScholarDigital Library
Wen, J. R., Nie, J. Y., and Zhang, H. 2002. Query clustering using user logs. ACM Trans. Info. Syst. 20, 1, 59--81. Google ScholarDigital Library
Zhao, Y. and Karypis, G. 2002. Evaluation of hierarchical clustering algorithms for document datasets. In Proceeding of the 11th Conference on Information and Knowledge Management (CIKM'02). ACM, New York, NY, 515--524. Google ScholarDigital Library
Zhao, Y. and Karypis, G. 2004. Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learn. 55, 3, 311--331. Google ScholarDigital Library

Index Terms

Discovering tasks from search engine query logs
1. Information systems

Recommendations

Identifying task-based sessions in search engine query logs
WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining

The research challenge addressed in this paper is to devise effective techniques for identifying task-based sessions, i.e. sets of possibly non contiguous queries issued by the user of a Web Search Engine for carrying out a given task. In order to ...
Read More
Intent mining in search query logs for automatic search script generation

Capturing users' information needs is essential in decreasing the barriers in information access. This paper mines sequences of actions called search scripts from search query logs which keep large-scale users' search experiences. Search scripts can ...
Read More
Constructing Complex Search Tasks with Coherent Subtask Search Goals

Nowadays, due to the explosive growth of web content and usage, users deal with their complex search tasks by web search engines. However, conventional search engines consider a search query corresponding only to a simple search task. In order to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Information Systems Volume 31, Issue 3
July 2013
202 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/2493175
Issue’s Table of Contents

Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 5 August 2013
- Accepted: 1 March 2013
- Revised: 1 June 2012
- Received: 1 May 2011
Published in tois Volume 31, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Query log analysis
collective task discovery
collective tasks
query clustering
user search intent
user search session boundaries
user task discovery
user tasks
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 45
  Total Citations
  View Citations
- 793
  Total Downloads
- Downloads (Last 12 months)15
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Discovering tasks from search engine query logs

ACM Transactions on Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

Identifying task-based sessions in search engine query logs

Intent mining in search query logs for automatic search script generation

Constructing Complex Search Tasks with Coherent Subtask Search Goals

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Discovering tasks from search engine query logs

ACM Transactions on Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

Identifying task-based sessions in search engine query logs

Intent mining in search query logs for automatic search script generation

Constructing Complex Search Tasks with Coherent Subtask Search Goals

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media