Evaluating Top-K Approximate Patterns via Text Clustering

Lucchese, Claudio; Orlando, Salvatore; Perego, Raffaele

doi:10.1007/978-3-319-43946-4_8

Evaluating Top-K Approximate Patterns via Text Clustering

Claudio Lucchese¹⁵,
Salvatore Orlando^15,16 &
Raffaele Perego¹⁵

Conference paper
First Online: 06 August 2016

1168 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9829))

Abstract

This work investigates how approximate binary patterns can be objectively evaluated by using as a proxy measure the quality achieved by a text clustering algorithm, where the document features are derived from such patterns. Specifically, we exploit approximate patterns within the well-known FIHC (Frequent Itemset-based Hierarchical Clustering) algorithm, which was originally designed to employ exact frequent itemsets to achieve a concise and informative representation of text data. We analyze different state-of-the-art algorithms for approximate pattern mining, in particular we measure their ability in extracting patterns that well characterize the document topics in terms of the quality of clustering obtained by FIHC. Extensive and reproducible experiments, conducted on publicly available text corpora, show that approximate itemsets provide a better representation than exact ones.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
\(set(\cdot )\) takes an indicator vector and returns the corresponding subset.
2.
http://www.cs.umb.edu/~smimarog/textmining/datasets/index.html.
3.
http://www.dataminingresearch.com/index.php/2010/09/classic3-classic4-datasets/.

References

Beil, F., Ester, M., Xiaowei, X.: Frequent term-based text clustering. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 436–442. ACM (2002)
Google Scholar
Cheng, H., Yu, P.S., Han, J.: Ac-close: Efficiently mining approximate closed itemsets by core pattern recovery. In: Sixth International Conference on Data Mining, 2006, ICDM 2006, pp. 839–844. IEEE (2006)
Google Scholar
Fung, Benjamin C. M Wang, K., Ester, M.: Hierarchical document clustering using frequent itemsets. In: Proceedings of SIAM International Conference on Data Mining (SDM), pp. 59–70 (2003)
Google Scholar
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Dubes (1988)
MATH Google Scholar
Lucchese, C., Orlando, S., Perego, R.: Fast and memory efficient mining of frequent closed itemsets. IEEE Trans. Knowl. Data Eng. 18, 21–36 (2006)
Article Google Scholar
Lucchese, C., Orlando, S., Perego, R.: A generative pattern model for mining binary datasets. In: Proceedings of the 2010 ACM Symposium on Applied Computing, pp. 1109–1110. ACM (2010)
Google Scholar
Lucchese, C., Orlando, S., Perego, R.: Mining top-k patterns from binary datasets in presence of noise. In: Proceedings of SIAM International Conference on Data Mining (SDM), pp. 165–176. SIAM (2010)
Google Scholar
Lucchese, C., Orlando, S., Perego, R.: A unifying framework for mining approximate top-k binary patterns. IEEE Trans. Knowl. Data Eng. 26, 2900–2913 (2014)
Article Google Scholar
Miettinen, P., Mielikainen, T., Gionis, A., Das, G., Mannila, H.: The discrete basis problem. IEEE Trans. Knowl. Data Eng. 20(10), 1348–1362 (2008)
Article Google Scholar
Miettinen, P., Vreeken, J.: Model order selection for boolean matrix factorization. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 51–59 (2011)
Google Scholar
Rissanen, J.: Modeling by shortest data description. Automatica 14(5), 465–471 (1978)
Article MATH Google Scholar
Wang, K., Chu, X., Liu, B.: Clustering transactions using large items. In: International Conference on Information and Knowledge Management, CIKM-99, pp. 483–490 (1999)
Google Scholar
Xiang, Y., Jin, R., Fuhry, D., Dragan, F.F.: Summarizing transactional databases with overlapped hyperrectangles. Data Min. Knowl. Discov. 23(2), 215–251 (2011)
Article MathSciNet MATH Google Scholar
Zaki, M.J., Hsiao, C.J.: Efficient algorithms for mining closed itemsets and their lattice structure. IEEE Trans. Knowl. Data Eng. 17(4), 462–478 (2005)
Article Google Scholar

Download references

Acknowledgments

This work was partially supported by the EC H2020 Program INFRAIA-1-2014-2015 SoBigData: Social Mining & Big Data Ecosystem (654024).

Author information

Authors and Affiliations

ISTI-CNR, Pisa, Italy
Claudio Lucchese, Salvatore Orlando & Raffaele Perego
DAIS - Università Ca’ Foscari Venezia, Venice, Italy
Salvatore Orlando

Authors

Claudio Lucchese
View author publications
You can also search for this author in PubMed Google Scholar
Salvatore Orlando
View author publications
You can also search for this author in PubMed Google Scholar
Raffaele Perego
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Claudio Lucchese .

Editor information

Editors and Affiliations

University of Science and Technology , Rolla, Missouri, USA
Sanjay Madria
Osaka University , Osaka, Japan
Takahiro Hara

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lucchese, C., Orlando, S., Perego, R. (2016). Evaluating Top-K Approximate Patterns via Text Clustering. In: Madria, S., Hara, T. (eds) Big Data Analytics and Knowledge Discovery. DaWaK 2016. Lecture Notes in Computer Science(), vol 9829. Springer, Cham. https://doi.org/10.1007/978-3-319-43946-4_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-43946-4_8
Published: 06 August 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-43945-7
Online ISBN: 978-3-319-43946-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics