Jaccard Distance
The Jaccard distance is a measure of how different two sets are i.e Unlike the Jaccard coefficient, which determines the similarity of two sets. The Jaccard distance is computed by subtracting the Jaccard coefficient from one, or by dividing the difference in the sizes of the union and the intersection of two sets by the size of the union.
Where:
- is the cardinality (size) of the intersection of sets A and B.
- is the cardinality (size) of the union of sets A and B.
- represents the cardinality (size) of symmetric difference of sets (A) and (B), containing elements that are in either set but not in their intersection.
The Jaccard distance is often used to calculate a nxn matrix For clustering and multidimensional scaling of n sample sets. This distance is a collection metric for all finite sets.
Example 1:
Python3
def jaccard_distance(set1, set2): #Symmetric difference of two sets Symmetric_difference = set1.symmetric_difference(set2) # Unions of two sets union = set1.union(set2) return len (Symmetric_difference) / len (union) set_a = { "Geeks" , "for" , "Geeks" , "NLP" , "DSc" } set_b = { "Geek" , "for" , "Geeks" , "DSc." , 'ML' , "DSA" } distance = jaccard_distance(set_a, set_b) print ( "Jaccard distance:" , distance) |
Output:
Jaccard distance: 0.75
EXAMPLE 2:
Suppose two persons, A and B, went shopping in a department store, and there are five items. Let A = {1, 1,1, 0,1} and B = {1, 1, 0, 0, 1} sets represent items they picked (1) or not (0). Then ‘Jaccard score’ will represent the similar items they bought, and Jaccard Distance measure of dissimilarity and is calculated as 1 minus the Jaccard similarity score:
Python
import numpy as np from sklearn.metrics import jaccard_score # predicted values y_pred = np.array([ 1 , 1 , 1 , 0 , 1 ]).reshape( - 1 , 1 ) # true values y_true = np.array([ 1 , 1 , 0 , 0 , 1 ]).reshape( - 1 , 1 ) # Calculate Jaccard Index jaccard_index = jaccard_score(y_true, y_pred) # Calculate Jaccard Distance jaccard_distance = 1 - jaccard_index print ( "Jaccard Index:" , jaccard_index) print ( "Jaccard Distance:" , jaccard_distance) |
Output:
Jaccard Index: 0.75
Jaccard Distance: 0.25
Conclusion
The Jaccard similarity coefficient is a useful tool to check the similarity of sets, with applications ranging from text analysis to recommendation systems to data deduplication. You may quickly compute Jaccard similarity to improve your data analysis and decision-making processes by learning the formula and employing Python’s capabilities.
How to Calculate Jaccard Similarity in Python
In Data Science, Similarity measurements between the two sets are a crucial task. Jaccard Similarity is one of the widely used techniques for similarity measurements in machine learning, natural language processing and recommendation systems. This article explains what Jaccard similarity is, why it is important, and how to compute it with Python.