Abstract
We consider the problem of indexing a text \(T\) (of length \(n\)) with a light data structure that supports efficient search of patterns \(P\) (of length \(m\)) allowing errors under the Hamming distance. We propose a hash-based strategy that employs two classes of hash functions—dubbed Hamming-aware and de Bruijn—to drastically reduce search space and memory footprint of the index, respectively.
We use our succinct hash data structure to solve the \(k\)-mismatch search problem in \(2n\log \sigma + o(n\log \sigma )\) bits of space with a randomized algorithm having smoothed complexity \(\mathcal {O}((2\sigma )^k(\log n)^k(\log m+\xi ) + (occ+1)\cdot m)\), where \(\sigma \) is the alphabet size, \(occ\) is the number of occurrences, and \(\xi \) is a term depending on \(m\), \(n\), and on the amplitude \(\epsilon \) of the noise perturbing text and pattern. Significantly, we obtain that for any \(\epsilon >0\), for \(m\) large enough, \(\xi \in \mathcal {O}(\log m)\): our results improve upon previous linear-space solutions of the \(k\)-mismatch problem.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L., et al.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10(3), R25 (2009)
Li, R., Yu, C., Li, Y., Lam, T.W., Yiu, S.M., Kristiansen, K., Wang, J.: SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25(15), 1966–1967 (2009)
Vezzi, F., Del Fabbro, C., Tomescu, A.I., Policriti, A.: rNA: a fast and accurate short reads numerical aligner. Bioinformatics 28(1), 123–124 (2012)
Policriti, A., Tomescu, A.I., Vezzi, F.: A randomized numerical aligner (rna). J. Comput. Syst. Sci. 78(6), 1868–1882 (2012)
Cole, R., Gottlieb, L.A., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: Proceedings of the Thirty-sixth Annual ACM Symposium on Theory of Computing, pp. 91–100. ACM (2004)
Chan, H.-L., Lam, T.-W., Sung, W.-K., Tam, S.-L., Wong, S.-S.: A linear size index for approximate pattern matching. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 49–59. Springer, Heidelberg (2006)
Chávez, E., Navarro, G.: A metric index for approximate string matching. In: Rajsbaum, S. (ed.) LATIN 2002. LNCS, vol. 2286, p. 181. Springer, Heidelberg (2002)
Maaß, M.G., Nowak, J.: Text indexing with errors. Journal of Discrete Algorithms 5(4), 662–681 (2007)
Navarro, G., Baeza-Yates, R.: A hybrid indexing method for approximate string matching. Journal of Discrete Algorithms 1(1), 205–239 (2000)
Li, R., Li, Y., Kristiansen, K., Wang, J.: SOAP: short oligonucleotide alignment program. Bioinformatics 24(5), 713–714 (2008)
Weiner, P.: Linear pattern matching algorithms. In: IEEE Conference Record of 14th Annual Symposium on Switching and Automata Theory, SWAT 2008, pp. 1–11. IEEE (1973)
Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing 22(5), 935–948 (1993)
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys (CSUR) 39(1), 2 (2007)
Jacobson, G.: Space-efficient static trees and graphs. In: 30th Annual Symposium on Foundations of Computer Science, pp. 549–554. IEEE (1989)
Spielman, D.A., Teng, S.H.: Smoothed analysis: an attempt to explain the behavior of algorithms in practice. Communications of the ACM 52(10), 76–84 (2009)
Spielman, D., Teng, S.H.: Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. In: Proceedings of the Thirty-third Annual ACM Symposium on Theory of Computing, pp. 296–305. ACM (2001)
Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proceedings of the 41st Annual Symposium on Foundations of Computer Science, pp. 390–398. IEEE (2000)
Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 841–850. Society for Industrial and Applied Mathematics (2003)
Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: An alphabet-friendly FM-index. In: Apostolico, A., Melucci, M. (eds.) SPIRE 2004. LNCS, vol. 3246, pp. 150–160. Springer, Heidelberg (2004)
Andoni, A., Krauthgamer, R.: The smoothed complexity of edit distance. In: Aceto, L., Damgård, I., Goldberg, L.A., Halldórsson, M.M., Ingólfsdóttir, A., Walukiewicz, I. (eds.) ICALP 2008, Part I. LNCS, vol. 5125, pp. 357–369. Springer, Heidelberg (2008)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Policriti, A., Prezza, N. (2014). Hashing and Indexing: Succinct DataStructures and Smoothed Analysis. In: Ahn, HK., Shin, CS. (eds) Algorithms and Computation. ISAAC 2014. Lecture Notes in Computer Science(), vol 8889. Springer, Cham. https://doi.org/10.1007/978-3-319-13075-0_13
Download citation
DOI: https://doi.org/10.1007/978-3-319-13075-0_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-13074-3
Online ISBN: 978-3-319-13075-0
eBook Packages: Computer ScienceComputer Science (R0)