skip to main content
research-article

Discovering tasks from search engine query logs

Published:05 August 2013Publication History
Skip Abstract Section

Abstract

Although Web search engines still answer user queries with lists of ten blue links to webpages, people are increasingly issuing queries to accomplish their daily tasks (e.g., finding a recipe, booking a flight, reading online news, etc.). In this work, we propose a two-step methodology for discovering tasks that users try to perform through search engines. First, we identify user tasks from individual user sessions stored in search engine query logs. In our vision, a user task is a set of possibly noncontiguous queries (within a user search session), which refer to the same need. Second, we discover collective tasks by aggregating similar user tasks, possibly performed by distinct users. To discover user tasks, we propose query similarity functions based on unsupervised and supervised learning approaches. We present a set of query clustering methods that exploit these functions in order to detect user tasks. All the proposed solutions were evaluated on a manually-built ground truth, and two of them performed better than state-of-the-art approaches. To detect collective tasks, we propose four methods that cluster previously discovered user tasks, which in turn are represented by the bag-of-words extracted from their composing queries. These solutions were also evaluated on another manually-built ground truth.

References

  1. Baeza-Yates, R., Gionis, A., Junqueira, F. P., Murdock, V., Plachouras, V., and Silvestri, F. 2008. Design trade-offs for search engine caching. ACM Trans. Web 2, 4, 1--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Beeferman, D. and Berger, A. 2000. Agglomerative clustering of a search engine query log. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'00). ACM, New York, NY. 407--416. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Boldi, P., Bonchi, F., Castillo, C., Donato, D., Gionis, A., and Vigna, S. 2008. The query-flow graph: Model and applications. In Proceedings of the 17th ACM International Conference on Information and Knowledge Management (CIKM'08). ACM, New York, NY, 609--618. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Broder, A. 2002. A taxonomy of Web search. SIGIR Forum 36, 2, 2, 3--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Cao, H., Jiang, D., Pei, J., He, Q., Liao, Z., Chen, E., and Li, H. 2008. Context-aware query suggestion by mining click-through and session data. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'08). ACM, New York, NY, 875--883. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Donato, D., Bonchi, F., Chi, T., and Maarek, Y. 2010. Do you want to take notes?: Identifying research missions in Yahoo! Search Pad. In Proceedings of the 19th International Conference on World Wide Web (WWW'10). ACM, New York, NY, 321--330. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Ester, M., Kriegel, H. P., Sander, J., and Xu, X. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'96). ACM, New York, NY, 226--231.Google ScholarGoogle Scholar
  9. Fu, L., Goh, D. H.-L., Foo, S.S.-B., and Na, J.-C. 2003. Collaborative querying through a hybrid query clustering approach. In Proceedings of the 6th International Conference on Asian Digital Libraries (ICADL'03). Lecture Notes in Computer Science, vol. 2911, Springer-Verlag, Berlin Heidelberg, 111--122.Google ScholarGoogle Scholar
  10. Gabrilovich, E. and Markovitch, S. 2007. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the 20th International Joint Conference on Artificial Intelligence. 6--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Gayo-Avello, D. 2009. A survey on session detection methods in query logs and a proposal for future evaluation. Info. Sci. 179, 12, 1822--1843. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Glance, N. S. 2001. Community search assistant. In Proceedings of the 6th ACM International Conference on Intelligent User Interfaces (IUI'01). ACM, New York, NY, 91--96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Guo, J., Cheng, X., Xu, G., and Zhu, X. 2011. Intent-aware query similarity. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM'11). ACM, New York, NY, 259--268. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. He, D. and Göker, A. 2000. Detecting session boundaries from Web user logs. In Proceedings of the 22nd Annual Colloquium on Information Retrieval Research (BCS-IRSG). 57--66.Google ScholarGoogle Scholar
  15. He, D., Göker, A., and Harper, D. J. 2002. Combining evidence for automatic web session identification. Info. Process. Manage. 38, 5, 727--742. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Hopcroft, J. and Tarjan, R. 1973. Algorithm 447: Efficient algorithms for graph manipulation. Commun. ACM 16, 6, 372--378. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Jansen, B. J. and Spink, A. 2006. How are we searching the world wide Web?: A comparison of nine search engine transaction logs. Info. Process. Manage. 42, 1, 248--263. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Jansen, B. J., Spink, A., Bateman, J., and Saracevic, T. 1998. Real life information retrieval: A study of user queries on the web. SIGIR Forum 32, 1, 5--17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Jansen, B. J., Spink, A., Blakely, C., and Koshman, S. 2007. Defining a session on Web search engines: Research articles. J. Amer. Soci. Info. Scie. Technol. 58, 6, 862--871. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Järvelin, A., Järvelin, A., and Järvelin, K. 2007. s-grams: Defining generalized n-grams for information retrieval. Info. Process. Manage. 43, 4, 1005--1019. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Jones, R. and Klinkner, K. L. 2008. Beyond the session timeout: Automatic hierarchical segmentation of search topics in query logs. In Proceedings of the 17th ACM International Conference on Information and Knowledge Management (CIKM'08). ACM, New York, NY, 699--708. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Kotov, A., Bennett, P. N., White, R. W., Dumais, S. T., and Teevan, J. 2011. Modeling and analysis of cross-session search tasks. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'11). ACM, New York, NY, 5--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Lau, T. and Horvitz, E. 1999. Patterns of search: Analyzing and modeling Web query refinement. In Proceedings of the 7th International Conference on User Modeling. Springer-Verlag, Berlin, 119--128. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Leacock, C. and Chodorow, M. 1998. Combining Local Context and WordNet Similarity for Word Sense Identification. The MIT Press, Cambridge, MA, 11, 265--283.Google ScholarGoogle Scholar
  25. Lee, U., Liu, Z., and Cho, J. 2005. Automatic identification of user goals in Web search. In Proceedings of the 14th International World Wide Web Conference (WWW'05). ACM, New York, NY, 391--400. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Lesk, M. 1986. Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. In Proceedings of the 5th ACM International Conference on Systems Documentation (SIGDOC'86). ACM, New York, NY, 24--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Leung, K. W. T., Ng, W., and Lee, D. L. 2008. Personalized concept-based clustering of search engine queries. IEEE Trans. Knowl. Data Engi. 20, 11, 1505--1518. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Lucchese, C., Orlando, S., Perego, R., Silvestri, F., and Tolomei, G. 2011. Identifying task-based sessions in search engine query logs. In Proceedings of the 4th ACM International Conference on Web Search and Data Mining (WSDM'11). ACM, New York, NY, 277--286. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. MacQueen, J. B. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, L. M. Le Cam and J. Neyman Eds., Vol. 1. University of California Press, Berkeley, CA, 281--297.Google ScholarGoogle Scholar
  30. Mei, Q., Klinkner, K., Kumar, R., and Tomkins, A. 2009. An analysis framework for search sequences. In Proceeding of the 18th Conference on Information and Knowledge Management (CIKM'09). ACM, New York, NY, 1991--1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Milne, D. and Witten, I. H. 2008. An effective, low-cost measure of semantic relatedness obtained from wikipedia links. In Proceedings of the 22nd Conference on Artificial Intelligence (AAAI'08). AAAI Press, Menlo Park, CA, 25--30.Google ScholarGoogle Scholar
  32. Ozmutlu, H. C. and çavdur, F. 2005. Application of automatic topic identification on excite web search engine data logs. Info. Process. Manage. 41, 5, 1243--1262. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Porter, M. F. 1980. An Algorithm for Suffix Stripping Vol. 14. Morgan Kaufmann Publishers, San Francisco, CA, 130--137.Google ScholarGoogle Scholar
  34. Quinlan, J. R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Francisco, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Rada, R., Mili, H., Bicknell, E., and Blettner, M. 1989. Development and application of a metric on semantic nets. IEEE Trans. Syst. Man Cybernet. 19, 1, 17--30.Google ScholarGoogle ScholarCross RefCross Ref
  36. Radlinski, F. and Joachims, T. 2005. Query chains: Learning to rank from implicit feedback. In Proceedings of the KDD Cup Workshop at the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'05). ACM, New York, NY, 239--248. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Raghavan, V. V. and Sever, H. 1995. On the reuse of past optimal queries. In Proceedings of the 18th ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR'95). ACM, New York, NY, 344--350. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Reed, W. 2001. The Pareto, zipf and other power laws. Econ. Lett. 74, 1, 15--19.Google ScholarGoogle ScholarCross RefCross Ref
  39. Resnik, P. 1995. Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI). 448--453. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Richardson, M. 2008. Learning about the world through long-term query logs. ACM Trans. Web 2, 4, 1--27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Rose, D. E. and Levinson, D. 2004. Understanding user goals in web search. In Proceedings of the 13th International World Wide Web Conference (WWW'04). ACM, New York, NY, 13--19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Salton, G. and Mcgill, M. J. 1986. Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Seco, N. and Cardoso, N. 2006. Detecting user sessions in the tumba! web log. Tech. rep. Faculdade de Ciências da Universidade de Lisboa.Google ScholarGoogle Scholar
  44. Shen, X., Tan, B., and Zhai, C. 2005. Implicit user modeling for personalized search. In Proceeding of the 14th Conference on Information and Knowledge Management (CIKM'05). ACM, New York, NY, 824--831. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Shi, X. and Yang, C. C. 2006. Mining related queries from search engine query logs. In Proceedings of the 15th International World Wide Web Conference (WWW'06). ACM, New York, NY, 943--944. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Silverstein, C., Marais, H., Henzinger, M., and Moricz, M. 1999. Analysis of a very large Web search engine query log. SIGIR Forum 33, 1, 6--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Silvestri, F. 2010. Mining Query Logs: Turning search usage data into knowledge. Found. Trends Info. Ret. 1, 1--2, 1--174. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Silvestri, F., Baraglia, R., Lucchese, C., Orlando, S., and Perego, R. 2008. (Query) history teaches everything, including the future. In Proceedings of the 6th Latin American Web Congress (LA-WEB'08). IEEE Computer Society, Washington, DC, 12--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Spink, A., Park, M., Jansen, B. J., and Pedersen, J. 2006. Multitasking during Web search sessions. Info. Process. Manage. 42, 1, 264--275. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Tan, P. N., Steinbach, M., and Kumar, V. 2005. Introduction to Data Mining. Addison-Wesley, Boston, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Wen, J. R., Nie, J. Y., and Zhang, H. 2002. Query clustering using user logs. ACM Trans. Info. Syst. 20, 1, 59--81. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Zhao, Y. and Karypis, G. 2002. Evaluation of hierarchical clustering algorithms for document datasets. In Proceeding of the 11th Conference on Information and Knowledge Management (CIKM'02). ACM, New York, NY, 515--524. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Zhao, Y. and Karypis, G. 2004. Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learn. 55, 3, 311--331. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Discovering tasks from search engine query logs

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Information Systems
            ACM Transactions on Information Systems  Volume 31, Issue 3
            July 2013
            202 pages
            ISSN:1046-8188
            EISSN:1558-2868
            DOI:10.1145/2493175
            Issue’s Table of Contents

            Copyright © 2013 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 5 August 2013
            • Accepted: 1 March 2013
            • Revised: 1 June 2012
            • Received: 1 May 2011
            Published in tois Volume 31, Issue 3

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader