Browse Articles

Suffix Array and Suffix Tree in Data Structures

Amit Kumar Ghosh  13 min read
16 Jun 2023
Intermediate
394 Views

Introduction

Suffix Array and Suffix Tree are two of the most powerful searching algorithms available today. When it comes to searching for a specific string or pattern in a large amount of text, these two algorithms have proven to be incredibly fast and accurate. With their combined use, complex searches can be executed at lightning speed without having to sacrifice accuracy for speed. By understanding how they work and when each should be used, you’ll gain greater control over your text search capabilities resulting in increased efficiency when parsing through vast amounts of data. Let's take an overview look into both Suffix Arrays and Trees to learn why they're so popular!

What is Suffiix array

A Suffix Array is a powerful data structure used in computer science that has broad applications, including string matching and bioinformatics. It is an ordered array of all the suffixes of a given string, sorted lexicographically. This structure provides fast and efficient memory usage by storing the suffixes of the original string in a compact form. One of the main features that make the Suffix Array so useful is its ability to quickly find the location of a substring within the original string. By analyzing the properties of the Suffix Array, we can gain valuable insights into the structure and patterns of strings. With its versatility and efficiency, it's no wonder that this data structure has become a staple in many algorithms and applications.

Naive method to build Suffix Array

The Suffix Array is a data structure used in string algorithms to efficiently store and retrieve information about the suffixes of a given string. Although there are more advanced algorithms to construct the Suffix Array with better time complexity, here is a simple and naive method to build a Suffix Array:


 # Structure to store information of a suffix
 class Suffix:
     def __init__(self, index, suff):
         self.index = index
         self.suff = suff
 
 # A comparison function used by sort() to compare two suffixes
 def cmp(a, b):
     return (a.suff < b.suff) - (a.suff > b.suff)
 
 # This is the main function that takes a string 'txt' of size n as an
 # argument, builds and returns the suffix array for the given string
 def buildSuffixArray(txt):
     n = len(txt)
     suffixes = [Suffix(0, '')] * n
 
     # Store suffixes and their indexes in an array of structures.
     # The structure is needed to sort the suffixes alphabetically
     # and maintain their old indexes while sorting
     for i in range(n):
         suffixes[i] = Suffix(i, txt[i:])
 
     # Sort the suffixes using the comparison function defined above
     suffixes.sort(key=lambda x: x.suff)
 
     # Store indexes of all sorted suffixes in the suffix array
     suffixArr = [0] * n
     for i in range(n):
         suffixArr[i] = suffixes[i].index
 
     # Return the suffix array
     return suffixArr
 
 # A utility function to print an array of given size
 def printArr(arr):
     n = len(arr)
     for i in range(n):
         print(arr[i], end=' ')
     print()
 
 # Driver program to test above functions
 txt = "banana"
 suffixArr = buildSuffixArray(txt)
 print("Following is suffix array for", txt)
 printArr(suffixArr)
 

 import java.util.Arrays;

 # Structure to store information of a suffix
 class Suffix:
     def __init__(self, index, suff):
         self.index = index
         self.suff = suff
 
 # A comparison function used by sort() to compare two suffixes
 def cmp(a, b):
     return (a.suff < b.suff) - (a.suff > b.suff)
 
 # This is the main function that takes a string 'txt' of size n as an
 # argument, builds and returns the suffix array for the given string
 def buildSuffixArray(txt):
     n = len(txt)
     suffixes = [Suffix(0, '')] * n
 
     # Store suffixes and their indexes in an array of structures.
     # The structure is needed to sort the suffixes alphabetically
     # and maintain their old indexes while sorting
     for i in range(n):
         suffixes[i] = Suffix(i, txt[i:])
 
     # Sort the suffixes using the comparison function defined above
     suffixes.sort(key=lambda x: x.suff)
 
     # Store indexes of all sorted suffixes in the suffix array
     suffixArr = [0] * n
     for i in range(n):
         suffixArr[i] = suffixes[i].index
 
     # Return the suffix array
     return suffixArr
 
 # A utility function to print an array of given size
 def printArr(arr):
     n = len(arr)
     for i in range(n):
         print(arr[i], end=' ')
     print()
 
 # Driver program to test above functions
 txt = "banana"
 suffixArr = buildSuffixArray(txt)
 print("Following is suffix array for", txt)
 printArr(suffixArr)

 // Naive algorithm for building suffix array of a given text
 #include <iostream>
 #include <cstring>
 #include <algorithm>
 using namespace std;
 
 // Structure to store information of a suffix
 struct suffix
 {
  int index;
  char *suff;
 };
 
 // A comparison function used by sort() to compare two suffixes
 int cmp(struct suffix a, struct suffix b)
 {
  return strcmp(a.suff, b.suff) < 0? 1 : 0;
 }
 
 // This is the main function that takes a string 'txt' of size n as an
 // argument, builds and return the suffix array for the given string
 int *buildSuffixArray(char *txt, int n)
 {
  // A structure to store suffixes and their indexes
  struct suffix suffixes[n];
 
  // Store suffixes and their indexes in an array of structures.
  // The structure is needed to sort the suffixes alphabetically
  // and maintain their old indexes while sorting
  for (int i = 0; i < n; i++)
  {
   suffixes[i].index = i;
   suffixes[i].suff = (txt+i);
  }
 
  // Sort the suffixes using the comparison function
  // defined above.
  sort(suffixes, suffixes+n, cmp);
 
  // Store indexes of all sorted suffixes in the suffix array
  int *suffixArr = new int[n];
  for (int i = 0; i < n; i++)
   suffixArr[i] = suffixes[i].index;
 
  // Return the suffix array
  return suffixArr;
 }
 
 // A utility function to print an array of given size
 void printArr(int arr[], int n)
 {
  for(int i = 0; i < n; i++)
   cout << arr[i] << " ";
  cout << endl;
 }
 
 // Driver program to test above functions
 int main()
 {
  char txt[] = "banana";
  int n = strlen(txt);
  int *suffixArr = buildSuffixArray(txt, n);
  cout << "Following is suffix array for " << txt << endl;
  printArr(suffixArr, n);
  return 0;
 }

Output

 Following is suffix array for banana
5 3 1 0 4 2

The output represents the suffix array of the string "banana". Each number in the output array corresponds to the starting index of a suffix in the original string. The suffixes are sorted in lexicographical order.

In this case, the suffix array is [5, 3, 1, 0, 4, 2], which means the sorted suffixes are:

 "a"
"ana"
"anana"
"banana"
"na"
"nana"

Note that the suffix array is 0-based, so the indices start from 0.

Application of suffix array

  1. Suffix arrays are data structures that are primarily used in string processing and pattern-matching algorithms. They provide efficient storage and retrieval of information about the suffixes of a given string. Here are some applications of suffix arrays:
  2. Pattern Searching: Suffix arrays are commonly used in pattern searching algorithms, such as the longest common prefix (LCP) algorithm and the enhanced version of the Knuth-Morris-Pratt (KMP) algorithm. These algorithms use suffix arrays to quickly locate occurrences of a pattern within a text string.
  3. Text Compression: Suffix arrays can be employed in text compression algorithms, such as the Burrows-Wheeler Transform (BWT). The BWT rearranges the characters of a text string to enhance the compression ratio, and suffix arrays play a crucial role in constructing the BWT.
  4. Genome Assembly: In bioinformatics, suffix arrays are extensively used in genome assembly algorithms. These algorithms reconstruct the complete DNA sequence of an organism by aligning and merging smaller DNA fragments. Suffix arrays enable efficient matching and comparison of DNA sequences.
  5. Data Mining and Bioinformatics: Suffix arrays are employed in various data mining applications, including substring matching, similarity searches, and sequence alignment. They are particularly useful in bioinformatics for tasks like DNA sequence analysis, gene identification, and protein motif discovery.
  6. Longest Common Substring: Suffix arrays can be utilized to find the longest common substring among multiple strings. By using a combination of LCP arrays and binary search techniques on the suffix array, the longest common substring can be determined efficiently.
  7. Text Indexing: Suffix arrays are useful for creating text indexes, which enable rapid searches and retrieval of specific information within a large text corpus. They form the backbone of index data structures like the enhanced suffix array (ESA) and FM-index used in full-text search engines.
  8. Data Compression: Suffix arrays can be employed in data compression algorithms like the Lempel-Ziv-Welch (LZW) algorithm. Suffix arrays help in efficiently encoding and decoding sequences of symbols, resulting in reduced storage requirements and improved compression ratios

What is a suffix tree

The world of computer science is brimming with fascinating concepts and theories, one of them being the suffix tree. In simple terms, a suffix tree is a data structure that allows for quick and efficient pattern matching in a large body of text. It's essentially a tree-like structure that represents all the suffixes of a given string. While this may sound complex, the suffix tree's power lies in its ability to speed up operations like text indexing and searching in a way that wasn't possible before. It has numerous applications, from DNA sequencing to text-based search engines, making it an indispensable tool for various industries.

The functionality of Suffix Tree

  1. Construction: The first step in utilizing a suffix tree is constructing it for a given input string. The construction process builds the tree by iteratively adding suffixes to the tree, ensuring that common prefixes are shared among different suffixes. Various algorithms, such as Ukkonen's algorithm, are used to construct the suffix tree efficiently in linear time.
  2. Pattern matching: Suffix trees are particularly useful for pattern matching and substring search operations. Given a pattern string, you can search for its occurrences in the original input string by traversing the suffix tree. The search process involves following edges corresponding to the characters in the pattern, and if the pattern is found, it can be efficiently located using the tree structure.
  3. Longest common substring: Suffix trees can efficiently find the longest common substring (LCS) between two or more strings. By comparing the paths from the root to the deepest internal nodes that correspond to different strings, the LCS can be determined.
  4. Substring operations: Suffix trees allow various substring operations, such as finding all occurrences of a substring, counting the number of occurrences, and locating the first occurrence. These operations are typically performed by traversing the appropriate paths in the suffix tree.
  5. String compression: Suffix trees can be used to compress a given string by storing it as a suffix tree and then discarding the original string. The compressed representation saves space by eliminating repeated substrings and leveraging the shared prefixes in the tree structure.
  6. Text indexing: Suffix trees are also utilized for efficient text indexing. They enable rapid retrieval of substring information and support operations like indexing the occurrences of a substring or determining the number of times a particular substring appears in the text

Applications of Suffix Tree

  1. Pattern searching: Suffix trees are extensively used for efficient pattern matching and searching in strings. They enable fast substring queries and can find all occurrences of a pattern in a given text in linear time, making them highly efficient for applications like text editors, search engines, and DNA sequencing.
  2. Longest common substring: Suffix trees can efficiently find the longest common substring between multiple strings. This is particularly useful in bioinformatics for analyzing DNA and protein sequences to identify common regions or motifs.
  3. Text compression: Suffix trees play a crucial role in data compression algorithms like Burrows-Wheeler Transform (BWT) and the related algorithm used in the popular compression tool, gzip. Suffix trees help identify repeated substrings, which can be encoded in a more compact manner.
  4. Data mining: Suffix trees are employed in data mining tasks, such as finding frequent patterns or motifs in large datasets. By constructing a suffix tree of a collection of strings, it becomes possible to efficiently identify common patterns, which can be useful in various applications like DNA analysis, web log analysis, and natural language processing.
  5. Bioinformatics and genomics: Suffix trees are extensively used in DNA sequencing and analysis, where they help identify repeated sequences, locate genes, analyze genomic rearrangements, and compare DNA sequences. They aid in tasks like genome assembly, sequence alignment, and gene finding.
  6. Computational linguistics: Suffix trees are valuable in natural language processing (NLP) tasks. They can be used for efficient keyword searching, text indexing, and analyzing linguistic structures. Applications include information retrieval, text summarization, spelling correction, and sentiment analysis.
  7. Data compression: Suffix trees are utilized in various lossless compression algorithms, including the Lempel-Ziv-Welch (LZW) compression algorithm. Suffix trees facilitate efficient encoding and decoding of data by constructing a dictionary of repeated substrings.
  8. Network analysis: Suffix trees have been employed for analyzing network data, such as detecting patterns in network traffic, identifying network motifs, and finding repeated substructures in network topologies
Summary

In conclusion, we have explored Suffix Array and Suffix Tree, two powerful tools for string text processing. Both methods are based on the idea of a suffixes array of strings that contain all suffixes from a given text. We discussed that in the case of a Suffix Array, all substrings present in the target text need to be sorted, compared, and indexed which can be extremely time-consuming. On the other hand, in Suffix Tree much faster searches are possible since each node represents a substring that is already sorted. To sum up, both approaches are incredibly useful and can make your data look like magic when used correctly. So don't wait any longer get exploring Suffix Arrays and Suffix Trees today!

Share
About Author
Amit Kumar Ghosh (SDE and Mentor)

A passionate professional with over 6 years of experience in development and training. He is passionate about learning new technologies and sharing his experience with professionals. He is an expert in C/C++, Java, Python and DSA.
Accept cookies & close this