Abstract
Although Web search engines still answer user queries with lists of ten blue links to webpages, people are increasingly issuing queries to accomplish their daily tasks (e.g., finding a recipe, booking a flight, reading online news, etc.). In this work, we propose a two-step methodology for discovering tasks that users try to perform through search engines. First, we identify user tasks from individual user sessions stored in search engine query logs. In our vision, a user task is a set of possibly noncontiguous queries (within a user search session), which refer to the same need. Second, we discover collective tasks by aggregating similar user tasks, possibly performed by distinct users. To discover user tasks, we propose query similarity functions based on unsupervised and supervised learning approaches. We present a set of query clustering methods that exploit these functions in order to detect user tasks. All the proposed solutions were evaluated on a manually-built ground truth, and two of them performed better than state-of-the-art approaches. To detect collective tasks, we propose four methods that cluster previously discovered user tasks, which in turn are represented by the bag-of-words extracted from their composing queries. These solutions were also evaluated on another manually-built ground truth.
- Baeza-Yates, R., Gionis, A., Junqueira, F. P., Murdock, V., Plachouras, V., and Silvestri, F. 2008. Design trade-offs for search engine caching. ACM Trans. Web 2, 4, 1--28. Google ScholarDigital Library
- Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston, MA. Google ScholarDigital Library
- Beeferman, D. and Berger, A. 2000. Agglomerative clustering of a search engine query log. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'00). ACM, New York, NY. 407--416. Google ScholarDigital Library
- Boldi, P., Bonchi, F., Castillo, C., Donato, D., Gionis, A., and Vigna, S. 2008. The query-flow graph: Model and applications. In Proceedings of the 17th ACM International Conference on Information and Knowledge Management (CIKM'08). ACM, New York, NY, 609--618. Google ScholarDigital Library
- Broder, A. 2002. A taxonomy of Web search. SIGIR Forum 36, 2, 2, 3--10. Google ScholarDigital Library
- Cao, H., Jiang, D., Pei, J., He, Q., Liao, Z., Chen, E., and Li, H. 2008. Context-aware query suggestion by mining click-through and session data. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'08). ACM, New York, NY, 875--883. Google ScholarDigital Library
- Donato, D., Bonchi, F., Chi, T., and Maarek, Y. 2010. Do you want to take notes?: Identifying research missions in Yahoo! Search Pad. In Proceedings of the 19th International Conference on World Wide Web (WWW'10). ACM, New York, NY, 321--330. Google ScholarDigital Library
- Ester, M., Kriegel, H. P., Sander, J., and Xu, X. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'96). ACM, New York, NY, 226--231.Google Scholar
- Fu, L., Goh, D. H.-L., Foo, S.S.-B., and Na, J.-C. 2003. Collaborative querying through a hybrid query clustering approach. In Proceedings of the 6th International Conference on Asian Digital Libraries (ICADL'03). Lecture Notes in Computer Science, vol. 2911, Springer-Verlag, Berlin Heidelberg, 111--122.Google Scholar
- Gabrilovich, E. and Markovitch, S. 2007. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the 20th International Joint Conference on Artificial Intelligence. 6--12. Google ScholarDigital Library
- Gayo-Avello, D. 2009. A survey on session detection methods in query logs and a proposal for future evaluation. Info. Sci. 179, 12, 1822--1843. Google ScholarDigital Library
- Glance, N. S. 2001. Community search assistant. In Proceedings of the 6th ACM International Conference on Intelligent User Interfaces (IUI'01). ACM, New York, NY, 91--96. Google ScholarDigital Library
- Guo, J., Cheng, X., Xu, G., and Zhu, X. 2011. Intent-aware query similarity. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM'11). ACM, New York, NY, 259--268. Google ScholarDigital Library
- He, D. and Göker, A. 2000. Detecting session boundaries from Web user logs. In Proceedings of the 22nd Annual Colloquium on Information Retrieval Research (BCS-IRSG). 57--66.Google Scholar
- He, D., Göker, A., and Harper, D. J. 2002. Combining evidence for automatic web session identification. Info. Process. Manage. 38, 5, 727--742. Google ScholarDigital Library
- Hopcroft, J. and Tarjan, R. 1973. Algorithm 447: Efficient algorithms for graph manipulation. Commun. ACM 16, 6, 372--378. Google ScholarDigital Library
- Jansen, B. J. and Spink, A. 2006. How are we searching the world wide Web?: A comparison of nine search engine transaction logs. Info. Process. Manage. 42, 1, 248--263. Google ScholarDigital Library
- Jansen, B. J., Spink, A., Bateman, J., and Saracevic, T. 1998. Real life information retrieval: A study of user queries on the web. SIGIR Forum 32, 1, 5--17. Google ScholarDigital Library
- Jansen, B. J., Spink, A., Blakely, C., and Koshman, S. 2007. Defining a session on Web search engines: Research articles. J. Amer. Soci. Info. Scie. Technol. 58, 6, 862--871. Google ScholarDigital Library
- Järvelin, A., Järvelin, A., and Järvelin, K. 2007. s-grams: Defining generalized n-grams for information retrieval. Info. Process. Manage. 43, 4, 1005--1019. Google ScholarDigital Library
- Jones, R. and Klinkner, K. L. 2008. Beyond the session timeout: Automatic hierarchical segmentation of search topics in query logs. In Proceedings of the 17th ACM International Conference on Information and Knowledge Management (CIKM'08). ACM, New York, NY, 699--708. Google ScholarDigital Library
- Kotov, A., Bennett, P. N., White, R. W., Dumais, S. T., and Teevan, J. 2011. Modeling and analysis of cross-session search tasks. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'11). ACM, New York, NY, 5--14. Google ScholarDigital Library
- Lau, T. and Horvitz, E. 1999. Patterns of search: Analyzing and modeling Web query refinement. In Proceedings of the 7th International Conference on User Modeling. Springer-Verlag, Berlin, 119--128. Google ScholarDigital Library
- Leacock, C. and Chodorow, M. 1998. Combining Local Context and WordNet Similarity for Word Sense Identification. The MIT Press, Cambridge, MA, 11, 265--283.Google Scholar
- Lee, U., Liu, Z., and Cho, J. 2005. Automatic identification of user goals in Web search. In Proceedings of the 14th International World Wide Web Conference (WWW'05). ACM, New York, NY, 391--400. Google ScholarDigital Library
- Lesk, M. 1986. Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. In Proceedings of the 5th ACM International Conference on Systems Documentation (SIGDOC'86). ACM, New York, NY, 24--26. Google ScholarDigital Library
- Leung, K. W. T., Ng, W., and Lee, D. L. 2008. Personalized concept-based clustering of search engine queries. IEEE Trans. Knowl. Data Engi. 20, 11, 1505--1518. Google ScholarDigital Library
- Lucchese, C., Orlando, S., Perego, R., Silvestri, F., and Tolomei, G. 2011. Identifying task-based sessions in search engine query logs. In Proceedings of the 4th ACM International Conference on Web Search and Data Mining (WSDM'11). ACM, New York, NY, 277--286. Google ScholarDigital Library
- MacQueen, J. B. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, L. M. Le Cam and J. Neyman Eds., Vol. 1. University of California Press, Berkeley, CA, 281--297.Google Scholar
- Mei, Q., Klinkner, K., Kumar, R., and Tomkins, A. 2009. An analysis framework for search sequences. In Proceeding of the 18th Conference on Information and Knowledge Management (CIKM'09). ACM, New York, NY, 1991--1994. Google ScholarDigital Library
- Milne, D. and Witten, I. H. 2008. An effective, low-cost measure of semantic relatedness obtained from wikipedia links. In Proceedings of the 22nd Conference on Artificial Intelligence (AAAI'08). AAAI Press, Menlo Park, CA, 25--30.Google Scholar
- Ozmutlu, H. C. and çavdur, F. 2005. Application of automatic topic identification on excite web search engine data logs. Info. Process. Manage. 41, 5, 1243--1262. Google ScholarDigital Library
- Porter, M. F. 1980. An Algorithm for Suffix Stripping Vol. 14. Morgan Kaufmann Publishers, San Francisco, CA, 130--137.Google Scholar
- Quinlan, J. R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Francisco, CA. Google ScholarDigital Library
- Rada, R., Mili, H., Bicknell, E., and Blettner, M. 1989. Development and application of a metric on semantic nets. IEEE Trans. Syst. Man Cybernet. 19, 1, 17--30.Google ScholarCross Ref
- Radlinski, F. and Joachims, T. 2005. Query chains: Learning to rank from implicit feedback. In Proceedings of the KDD Cup Workshop at the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'05). ACM, New York, NY, 239--248. Google ScholarDigital Library
- Raghavan, V. V. and Sever, H. 1995. On the reuse of past optimal queries. In Proceedings of the 18th ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR'95). ACM, New York, NY, 344--350. Google ScholarDigital Library
- Reed, W. 2001. The Pareto, zipf and other power laws. Econ. Lett. 74, 1, 15--19.Google ScholarCross Ref
- Resnik, P. 1995. Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI). 448--453. Google ScholarDigital Library
- Richardson, M. 2008. Learning about the world through long-term query logs. ACM Trans. Web 2, 4, 1--27. Google ScholarDigital Library
- Rose, D. E. and Levinson, D. 2004. Understanding user goals in web search. In Proceedings of the 13th International World Wide Web Conference (WWW'04). ACM, New York, NY, 13--19. Google ScholarDigital Library
- Salton, G. and Mcgill, M. J. 1986. Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New York, NY. Google ScholarDigital Library
- Seco, N. and Cardoso, N. 2006. Detecting user sessions in the tumba! web log. Tech. rep. Faculdade de Ciências da Universidade de Lisboa.Google Scholar
- Shen, X., Tan, B., and Zhai, C. 2005. Implicit user modeling for personalized search. In Proceeding of the 14th Conference on Information and Knowledge Management (CIKM'05). ACM, New York, NY, 824--831. Google ScholarDigital Library
- Shi, X. and Yang, C. C. 2006. Mining related queries from search engine query logs. In Proceedings of the 15th International World Wide Web Conference (WWW'06). ACM, New York, NY, 943--944. Google ScholarDigital Library
- Silverstein, C., Marais, H., Henzinger, M., and Moricz, M. 1999. Analysis of a very large Web search engine query log. SIGIR Forum 33, 1, 6--12. Google ScholarDigital Library
- Silvestri, F. 2010. Mining Query Logs: Turning search usage data into knowledge. Found. Trends Info. Ret. 1, 1--2, 1--174. Google ScholarDigital Library
- Silvestri, F., Baraglia, R., Lucchese, C., Orlando, S., and Perego, R. 2008. (Query) history teaches everything, including the future. In Proceedings of the 6th Latin American Web Congress (LA-WEB'08). IEEE Computer Society, Washington, DC, 12--22. Google ScholarDigital Library
- Spink, A., Park, M., Jansen, B. J., and Pedersen, J. 2006. Multitasking during Web search sessions. Info. Process. Manage. 42, 1, 264--275. Google ScholarDigital Library
- Tan, P. N., Steinbach, M., and Kumar, V. 2005. Introduction to Data Mining. Addison-Wesley, Boston, MA. Google ScholarDigital Library
- Wen, J. R., Nie, J. Y., and Zhang, H. 2002. Query clustering using user logs. ACM Trans. Info. Syst. 20, 1, 59--81. Google ScholarDigital Library
- Zhao, Y. and Karypis, G. 2002. Evaluation of hierarchical clustering algorithms for document datasets. In Proceeding of the 11th Conference on Information and Knowledge Management (CIKM'02). ACM, New York, NY, 515--524. Google ScholarDigital Library
- Zhao, Y. and Karypis, G. 2004. Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learn. 55, 3, 311--331. Google ScholarDigital Library
Index Terms
- Discovering tasks from search engine query logs
Recommendations
Identifying task-based sessions in search engine query logs
WSDM '11: Proceedings of the fourth ACM international conference on Web search and data miningThe research challenge addressed in this paper is to devise effective techniques for identifying task-based sessions, i.e. sets of possibly non contiguous queries issued by the user of a Web Search Engine for carrying out a given task. In order to ...
Intent mining in search query logs for automatic search script generation
Capturing users' information needs is essential in decreasing the barriers in information access. This paper mines sequences of actions called search scripts from search query logs which keep large-scale users' search experiences. Search scripts can ...
Constructing Complex Search Tasks with Coherent Subtask Search Goals
Nowadays, due to the explosive growth of web content and usage, users deal with their complex search tasks by web search engines. However, conventional search engines consider a search query corresponding only to a simple search task. In order to ...
Comments