Browse Articles

Hashing in Data Structure: Types and Functions

Amit Kumar Ghosh  18 min read
21 Sep 2023
Intermediate
331 Views

Hashing in Data Structures: An Overview

Do you want to learn about how data is stored and organized? Are you interested in the concepts behind efficient algorithms? Have you ever heard of hashing but aren't sure how it works or why it's important? Hashing is an integral part of any data structure, so learning about it can enhance your understanding of these powerful organizational tools. In this article, we'll explain some key concepts in hashing and help you understand why it has become such a popular technique for organizing large sets of data.

What is Hashing?

Data can be hashed into a shorter, fixed-length value for quicker access using a key or set of characters. This is how key-value pairs are stored in hash tables. A hash function generates hash values by mapping keys to the size of the table. In order to prevent reverse conversion to the original key, these values are produced through a one-way mathematical method. 

What is Hashing in Data Structures?

Hashing in data structures is a technique used to process and manage large amounts of data efficiently. It involves converting key values into indices to access data and store it in a hash table. Hashing in data structures, which can be found in databases and search engines, enables quick and effective data retrieval, decreases the need for storage, and is essential for working with huge datasets in data science and technology.

How does Hashing in Data Structures Work?

Hashing

  1. Input: The hashing process begins with an input, which can be any type of data, such as a password or a file.
  2. Hash Function: A hash function is a mathematical algorithm that takes the input and produces a unique hash value. The hash function performs complex calculations on the input and generates a fixed-size output. The resulting hash value is typically represented as a sequence of alphanumeric characters.
  3. Deterministic: A hash function is deterministic, meaning that for the same input, it will always produce the same hash value. This property is crucial for data integrity and consistency.
  4. Fixed Size: Hash functions produce hash values of a fixed size, regardless of the size of the input. This fixed-size representation allows for efficient storage and comparison of data.
  5. Collision Resistance: A good hash function aims to minimize the occurrence of hash collisions, where two different inputs produce the same hash value. While it's technically possible to have collisions due to the infinite input space and finite hash space, modern hash functions are designed to have an extremely low probability of collisions.
  6. Irreversibility: Hash functions are designed to be one-way functions, meaning that it's computationally infeasible to retrieve the original input from the hash value. This property adds a layer of security and is commonly used in password hashing, digital signatures, and data integrity checks.
  7. Uniqueness: Hash functions should strive to produce unique hash values for different inputs. While there can be a chance of collisions, a good hash function minimizes the likelihood of such occurrences

What is the 'Key' in Hashing?

The raw data that must be hashed in a hash table is called the Hashing Key. The function that converts the hash key into the hash value is carried out by the hashing algorithm. The hash value is the result of running the hash key through the hashing algorithm. The various types of hashing keys are: 

  • Public Key: Frequently used in cryptography, bitcoin transfers, and online security; slower because it is public; used for data encryption. works for general security when paired with a corresponding private key.
  • Private key: Commonly referred to as "symmetric," a private key is a shared piece of information that is used for both encryption and decryption.
  • SSH Public Key: For remote communication and authentication, SSH uses both public and private keys that are available to remote servers and stakeholders. 

What is a Hash Table?

An associative data structure called a hash table holds information as associated keys. Data from hash tables is stored in an array format, where each value is given a distinct index. Knowing the desired data's index allows us to rapidly retrieve it. For a result, regardless of data quantity, inserting and searching data is very quick. Elements are kept in an array and produced as an index in a data structure using hashing techniques in a hash table. 

What is the Hash Function in Data Structure?

A hash function in the data structure is a mathematical function that takes an input (or "message") and produces a fixed-size string of characters, which is typically a sequence of alphanumeric characters. The output, known as the hash value or hash code, is unique to the input data, meaning even a small change in the input will result in a significantly different hash value. Some of the key properties of a hash function include:

  • Deterministic: For a given input, the hash function always produces the same output.
  • Fast Computation: Hash functions are designed to quickly compute the hash value for any given input, regardless of its size.
  • Pre-image Resistance: It should be computationally infeasible to determine the original input data from its hash value.
  • Collision Resistance: It should be highly unlikely for two different inputs to produce the same hash value.
  • Fixed Output Size: The hash function produces a fixed-length output, regardless of the size of the input data.

Commonly used hash functions include MD5 (Message Digest 5), SHA-1 (Secure Hash Algorithm 1), SHA-256, and SHA-3. However, MD5 and SHA-1 are now considered weak for cryptographic purposes due to vulnerabilities that have been discovered over time. More secure hash functions like SHA-256 and SHA-3 are recommended for cryptographic applications.

Use cases of Hashing

  • Password Storage: Hash functions are commonly used to securely store passwords. Instead of storing the actual passwords, the system stores their hash values. When a user enters a password, it is hashed and compared with the stored hash value for authentication.
  • Data Integrity: Hashing is used to ensure data integrity by generating hash values for files or messages. By comparing the hash values before and after transmission or storage, it's possible to detect if any changes or tampering occurred.
  • Data Retrieval: Hashing is used in data structures like hash tables, which provide efficient data retrieval based on key-value pairs. The hash value serves as an index to store and retrieve data quickly.
  • Digital Signatures: Hash functions are an integral part of digital signatures. They are used to generate a unique hash value for a message, which is then encrypted with the signer's private key. This allows for verification of the authenticity and integrity of the message using the signer's public key.

Example

Python

 import hashlib

 def sha256(input):

     hash_object = hashlib.sha256(input.encode('utf-8'))

     hex_digest = hash_object.hexdigest()

     return hex_digest

 def main():

     message = "Hello, world!"

     hash_value = sha256(message)

     print("Message: " + message)

     print("SHA-256 Hash: " + hash_value)

 if __name__ == "__main__":

     main()

Java

import java.nio.charset.StandardCharsets;

import java.security.MessageDigest;

import java.security.NoSuchAlgorithmException;

public class Main {

    public static void main(String[] args) {

        String message = "Hello, world!";

        String hashValue = sha256(message);

        System.out.println("Message: " + message);

        System.out.println("SHA-256 Hash: " + hashValue);

    }

    public static String sha256(String input) {

        try {

            MessageDigest digest = MessageDigest.getInstance("SHA-256");

            byte[] hash = digest.digest(input.getBytes(StandardCharsets.UTF_8));

            StringBuilder hexDigest = new StringBuilder();

            for (byte b : hash) {

                String hex = String.format("%02x", b);

                hexDigest.append(hex);

            }

            return hexDigest.toString();

        } catch (NoSuchAlgorithmException e) {

            e.printStackTrace();

            return null;

        }

    }

}

C++

#include <iostream>

#include <iomanip>

#include <sstream>

#include <openssl/sha.h>

std::string sha256(const std::string &input) {

    unsigned char hash[SHA256_DIGEST_LENGTH];

    SHA256_CTX sha256;

    SHA256_Init(&sha256);

    SHA256_Update(&sha256, input.c_str(), input.size());

    SHA256_Final(hash, &sha256);

    std::stringstream ss;

    for (int i = 0; i < SHA256_DIGEST_LENGTH; i++) {

        ss << std::hex << std::setw(2) << std::setfill('0') << static_cast<int>(hash[i]);

    }

    return ss.str();

}

int main() {

    std::string message = "Hello, world!";

    std::string hashValue = sha256(message);

    std::cout << "Message: " << message << std::endl;

    std::cout << "SHA-256 Hash: " << hashValue << std::endl;

    return 0;

}

Explanation

The "Hello, world!" message's SHA-256 hash value is calculated in this example and the result is printed.

Output:

Message: Hello, world!
SHA-256Hash: 7f83b1657ff1fc53b92dc18148a1d65dfc2d4b1fa3d677284addd200126d9069

Types of Hash Functions

The primary types of hash functions are:

  • Division Method.
  • Mid Square Method.
  • Folding Method.
  • Multiplication Method.  

Division Method

The easiest and quickest way to create a hash value is through division. The k-value is divided by M in this hash function, and the result is used. 

Formula:

h(K) = k mod M

(where k = key value and M = the size of the hash table)  

Advantages:

  • This method is effective for all values of M.
  • The division strategy only requires one operation, thus it is quite quick. 

Disadvantages:

  • Since the hash table maps consecutive keys to successive hash values, this could result in poor performance.
  • There are times when exercising extra caution while selecting M's value is necessary. 

Example:

k = 1987

M = 13

h(1987) = 1987 mod 13

h(1987) = 4

Mid Square Method

The following steps are required to calculate this hash method:

  • k*k, or square the value of k
  • Using the middle r digits, calculate the hash value. 

Formula:

h(K) = h(k x k)

(where k = key value)

Advantages:

  • This technique works well because most or all of the digits in the key value affect the result. All of the necessary digits participate in a process that results in the middle digits of the squared result.
  • The result is not dominated by the top or bottom digits of the initial key value. 

Disadvantages: 

  • The size of the key is one of the limitations of this system; if the key is large, its square will contain twice as many digits.
  • Probability of collisions occurring repeatedly. 

Example:

k = 60

Therefore,

k = k x k

k = 60 x 60

k = 3600

Thus,

h(60) = 60

Folding Method

The process involves two steps:

  • With the exception of the last component, which may have fewer digits than the other parts, the key-value k should be divided into a predetermined number of pieces, such as k1, k2, k3,..., kn, each having the exact same amount of digits.
  • Add each element individually. The hash value is calculated without taking into account the final carry, if any. 

Formula:

k = k1, k2, k3, k4, ….., kn

s = k1+ k2 + k3 + k4 +….+ kn 

h(K)= s

(Where, s = addition of the parts of key k)

Advantages: 

  • Creates a simple hash value by precisely splitting the key value into equal-sized segments.
  • Without regard to distribution in a hash table.

Disadvantages:

  • When there are too many collisions, efficiency can occasionally suffer. 

Example:

k = 12345

k1 = 67; k2 = 89; k3 = 12

Therefore,

s = k1 + k2 + k3

s = 67 + 89 + 12

s = 168

Multiplication Method

  • Determine a constant value. A, where (0, A, 1)
  • Add A to the key value and multiply.
  • Consider kA's fractional portion.
  • Multiply the outcome of the preceding step by M, the hash table's size. 

Formula:

h(K) = floor (M (kA mod 1))

(Where, M = size of the hash table, k = key value and A = constant value)

Advantages:

  • Any number between 0 and 1 can be applied to it, however, some values seem to yield better outcomes than others. 

Disadvantages: 

  • The multiplication method is often appropriate when the table size is a power of two since multiplication hashing makes it possible to quickly compute the index by key. 

Example:

k = 5678

A = 0.6829

M = 200

Now, calculating the new value of h(5678):

h(5678) = floor[200(5678 x 0.6829 mod 1)]

h(5678) = floor[200(3881.5702 mod 1)]

h(5678) = floor[200(0.5702)]

h(5678) = floor[114.04]

h(5678) = 114

So, with the updated values, h(5678) is 114.

What is a Hash Collision?

When a hash algorithm generates the same hash value for two separate input values, this is known as a hash collision. However, it's crucial to note that collisions are not an issue; rather, they constitute a key component of hashing algorithms. Because various hashing methods used in data structures convert each input into a fixed-length code regardless of its length, collisions happen. The hashing algorithms will eventually yield repeating hashes since there are an infinite number of inputs and a finite number of outputs. 

Types of Hashing in Data Structures

The two main hashing methods used in a data structure are:

  • Open hashing/separate chaining/closed addressing
  • Open addressing/closed hashing 

Open hashing/separate chaining/closed addressing

A typical collision handling technique called "separate chaining" links components with the same hash using linked lists. It is also known as closed addressing and employs arrays of linked lists to successfully prevent hash collisions.

Advantages:

  • Implementation is simple and easy
  • We can add more keys to the table because the hash table has a lot of empty places.
  • Less sensitive than average to changing load factors
  • Typically utilized when there is uncertainty on the number and frequency of keys to be used in the hash table. 

Disadvantages:

  • Space is wasted
  • The length of the chain lengthens the search period.
  • Comparatively worse cache performance to closed hashing. 

Closed hashing (Open addressing)

Instead of using linked lists, open addressing stores each entry in the array itself. The hash value is not used to locate objects. In order to insert, it first verifies the array beginning from the hashed index and then searches for an empty slot using probing sequences. The probe sequence, with changing gaps between subsequent probes, is the process of progressing through entries. There are three methods for dealing with collisions in closed hashing:

Linear Probing 

Linear probing includes inspecting the hash table sequentially from the very beginning. If the site requested is already occupied, a different one is searched. The distance between probes in linear probing is typically fixed (often set to a value of 1). 

Formula

index = key % hashTableSize 

Sequence

index = ( hash(n) % T)

(hash(n) + 1) % T

(hash(n) + 2) % T

(hash(n) + 3) % T … and so on. 

Quadratic Probing

The distance between subsequent probes or entry slots is the only difference between linear and quadratic probing. You must begin traversing until you find an available hashed index slot for an entry record if the slot is already taken. By adding each succeeding value of any arbitrary polynomial in the original hashed index, the distance between slots is determined. 

Formula:

index = index % hashTableSize 

Sequence:

index = ( hash(n) % T)

(hash(n) + 1 x 1) % T

(hash(n) + 2 x 2) % T

(hash(n) + 3 x 3) % T … and so on 

Double-Hashing 

The time between probes is determined by yet another hash function. Double hashing is an optimized technique for decreasing clustering. The increments for the probing sequence are computed using an extra hash function. 

Formula

(first hash(key) + i * secondHash(key)) % size of the table 

Sequence

index = hash(x) % S

(hash(x) + 1*hash2(x)) % S

(hash(x) + 2*hash2(x)) % S

(hash(x) + 3*hash2(x)) % S … and so on 

FAQs

1. What is hashing in data structures?

The process of hashing in data structures reduces data to a fixed-length value, making it simpler to find or maintain.

2. What are the two hashing techniques in data structure?

Separate chaining, as well as open addressing, are two types of hashing used in data structures.

3. What is the hash function with an example?

A mathematical process known as a hash function converts data into a fixed-length value. SHA-256 is an example.

4. What is the concept of hashing?

Data is mapped to fixed-size values with the help of a hash function to enable efficient storage & retrieval.

5. What is the main purpose of hashing?

Hashing is mostly used to find and access data rapidly in data structures like hash tables.

6. How many types of hashing are there?

Separate chaining and open addressing are the two primary types of hashing, each of which includes several subtypes.

Summary

For effective organization, data structures include hashing, which entails turning data into fixed-length values. Separate chaining and open addressing are the two primary hashing methods. Data is transformed into distinct fixed-length codes by hash methods like SHA-256. Hashing is useful for password storage, data integrity, and digital signatures and speeds up data retrieval while requiring less storage. Division, mid square, folding, and multiplication procedures are only a few of the several hash function kinds. 

Share
About Author
Amit Kumar Ghosh (SDE and Mentor)

A passionate professional with over 6 years of experience in development and training. He is passionate about learning new technologies and sharing his experience with professionals. He is an expert in C/C++, Java, Python and DSA.
Accept cookies & close this