Last Updated on 2021-06-19 by Clay
Cosine Similarity is a common calculation method for calculating text similarity. The basic concept is very simple, it is to calculate the angle between two vectors.
The angle larger, the less similar the two vectors are.
The angle smaller, the more similar the two vectors are.
There are three vectors A, B, C. We will say that C and B are more similar.
And then, how do we calculate Cosine similarity? Although the formula is given at the top, it is directly implemented using code.
Code
If we want to calculate the cosine similarity, we need to calculate the dot value of A and B, and the lengths of A, B.
Python Script:
from sklearn.metrics.pairwise import cosine_similarity # Vectors vec_a = [1, 2, 3, 4, 5] vec_b = [1, 3, 5, 7, 9] # Dot and norm dot = sum(a*b for a, b in zip(vec_a, vec_b)) norm_a = sum(a*a for a in vec_a) ** 0.5 norm_b = sum(b*b for b in vec_b) ** 0.5 # Cosine similarity cos_sim = dot / (norm_a*norm_b) # Results print('My version:', cos_sim) print('Scikit-Learn:', cosine_similarity([vec_a], [vec_b]))
Output:
My version: 0.9972413740548081
Scikit-Learn: [[0.99724137]]
The previous part of the code is the implementation of the cosine similarity formula above, and the bottom part is directly calling the function in Scikit-Learn to complete it. As you can see, the scores calculated on both sides are basically the same.