Extraction of high quality k-words for alignment-free sequence comparison Academic Article uri icon


  • The weighted Euclidean distance (D(2)) is one of the earliest dissimilarity measures used for alignment free comparison of biological sequences. This distance measure and its variants have been used in numerous applications due to its fast computation, and many variants of it have been subsequently introduced. The D(2) distance measure is based on the count of k-words in the two sequences that are compared. Traditionally, all k-words are compared when computing the distance. In this paper we show that similar accuracy in sequence comparison can be achieved by using a selected subset of k-words. We introduce a term variance based quality measure for identifying the important k-words. We demonstrate the application of the proposed technique in phylogeny reconstruction and show that up to 99% of the k-words can be filtered out for certain datasets, resulting in faster sequence comparison. The paper also presents an exploratory analysis based evaluation of optimal k-word values and discusses the impact of using subsets of k-words in such optimal instances.

publication date

  • 2014