Skip to main content

Improving Clustering Quality by Automatic Text Summarization

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9460))

Abstract

Automatic text summarization is the process of reducing the size of a text document, to create a summary that retains the most important points of the original document. It can thus be applied to summarize the original document by decreasing the importance or removing part of the content. The contribution of this paper in this field is twofold. First we show that text summarization can improve the performance of classical text clustering algorithms, in particular by reducing noise coming from long documents that can negatively affect clustering results. Moreover, the clustering quality can be used to quantitatively evaluate different summarization methods. In this regards, we propose a new graph-based summarization technique for keyphrase extraction, and use the Classic4 and BBC NEWS datasets to evaluate the improvement in clustering quality obtained using text summarization.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    \(f=l=1\), where f and l are the number of sentences in FS and LS, respectively.

  2. 2.

    http://www.dataminingresearch.com/index.php/2010/09/classic3-classic4-datasets/.

  3. 3.

    http://duc.nist.gov/data.html.

  4. 4.

    http://tartarus.org/martin/PorterStemmer/.

  5. 5.

    http://www.berouge.com/.

  6. 6.

    For the convergence of HITS, we stop iterating when for any vertex i in the graph the difference between the scores computed at two successive iterations fall below a given threshold:\(\frac{|x_{i}^{k+1}-x_{i}^{k}|}{x_{i}^{k}}<10^{-3}\) [10].

  7. 7.

    https://rapidminer.com/products/studio/.

References

  1. Aggarwal, C.C., Yu, P.S.: Finding generalized projected clusters in high dimensional spaces, vol. 29. ACM (2000)

    Google Scholar 

  2. Blei, D.M.: Probabilistic topic models. Commun. ACM 55(4), 77–84 (2012)

    Article  MathSciNet  Google Scholar 

  3. Dash, M., Koot, P.W.: Feature selection for clustering. In: Encyclopedia of Database Systems, pp. 1119–1125. Springer, New York (2009)

    Google Scholar 

  4. Greene, D., Cunningham, P.: Practical solutions to the problem of diagonal dominance in kernel document clustering. In: Proceedings of the 23rd international conference on Machine learning, pp. 377–384. ACM (2006)

    Google Scholar 

  5. Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM (JACM) 46(5), 604–632 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  6. Lin, C.-Y.: ROUGE: a package for automatic evaluation of summaries. In: Text summarization branches out, Proceedings of the ACL-04 workshop, vol. 8 (2004)

    Google Scholar 

  7. Litvak, M., Last, M.: Graph-based keyword extraction for single-document summarization. In: Proceedings of the workshop on Multi-source Multilingual Information Extraction and Summarization, pp. 17–24. Association for Computational Linguistics (2008)

    Google Scholar 

  8. Liu, T., Liu, S., Chen, Z., Ma, W.-Y.: An evaluation on feature selection for text clustering. ICML 3, 488–495 (2003)

    Google Scholar 

  9. Mihalcea, R.: Graph-based ranking algorithms for sentence extraction, applied to text summarization. In: Proceedings of the ACL 2004 on Interactive poster and demonstration sessions, p. 20. Association for Computational Linguistics (2004)

    Google Scholar 

  10. Mihalcea, R., Tarau, P.: TextRank: Bringing order into texts. Association for Computational Linguistics (2004)

    Google Scholar 

  11. Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Association for Computational Linguistics (2002)

    Google Scholar 

  12. Silva, J.A., Faria, E.R., Barros, R.C., Hruschka, E.R., de Carvalho, A.C., Gama, J.: Data stream clustering: a survey. ACM Comput. Surv. (CSUR) 46(1), 13 (2013)

    Article  Google Scholar 

  13. Wyse, N., Dubes, R., Jain, A.K.: A critical evaluation of intrinsic dimensionality algorithms. In: Pattern recognition in Practice, pp. 415–425 (1980)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohsen Pourvali .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Pourvali, M., Orlando, S., Gharagozloo, M. (2015). Improving Clustering Quality by Automatic Text Summarization. In: Zuccon, G., Geva, S., Joho, H., Scholer, F., Sun, A., Zhang, P. (eds) Information Retrieval Technology. AIRS 2015. Lecture Notes in Computer Science(), vol 9460. Springer, Cham. https://doi.org/10.1007/978-3-319-28940-3_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-28940-3_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-28939-7

  • Online ISBN: 978-3-319-28940-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics