Library  "SimilarityMeasures"
Similarity measures are statistical methods used to quantify the distance between different data sets
or strings. There are various types of similarity measures, including those that compare:
- data points (SSD, Euclidean, Manhattan, Minkowski, Chebyshev, Correlation, Cosine, Camberra, MAE, MSE, Lorentzian, Intersection, Penrose Shape, Meehl),
- strings (Edit(Levenshtein), Lee, Hamming, Jaro),
- probability distributions (Mahalanobis, Fidelity, Bhattacharyya, Hellinger),
- sets (Kumar Hassebrook, Jaccard, Sorensen, Chi Square).
---
These measures are used in various fields such as data analysis, machine learning, and pattern recognition. They
help to compare and analyze similarities and differences between different data sets or strings, which
can be useful for making predictions, classifications, and decisions.
---
References:
en.wikipedia.org/wiki/Similarity_measure
cran.r-project.org/web/packages/SimilarityMeasures/index.html
numerics.mathdotnet.com/Distance
github.com/ngmarchant/comparator
github.com/drostlab/philentropy/blob/7bdefc99f6a7016ad3f90f963d784608edfe74fb/src/distances.h
github.com/scipy/scipy/blob/v1.11.2/scipy/spatial/distance.py
Encyclopedia of Distances, doi.org/10.1007/978-3-662-52844-0
ssd(p, q)
Sum of squared difference for N dimensions.
Parameters:
p (float[]): `array<float>` Vector with first numeric distribution.
q (float[]): `array<float>` Vector with second numeric distribution.
Returns: Measure of distance that calculates the squared euclidean distance.
euclidean(p, q)
Euclidean distance for N dimensions.
Parameters:
p (float[]): `array<float>` Vector with first numeric distribution.
q (float[]): `array<float>` Vector with second numeric distribution.
Returns: Measure of distance that calculates the straight-line (or Euclidean).
manhattan(p, q)
Manhattan distance for N dimensions.
Parameters:
p (float[]): `array<float>` Vector with first numeric distribution.
q (float[]): `array<float>` Vector with second numeric distribution.
Returns: Measure of absolute differences between both points.
minkowski(p, q, p_value)
Minkowsky Distance for N dimensions.
Parameters:
p (float[]): `array<float>` Vector with first numeric distribution.
q (float[]): `array<float>` Vector with second numeric distribution.
p_value (float): `float` P value, default=1.0(1: manhatan, 2: euclidean), does not support chebychev.
Returns: Measure of similarity in the normed vector space.
chebyshev(p, q)
Chebyshev distance for N dimensions.
Parameters:
p (float[]): `array<float>` Vector with first numeric distribution.
q (float[]): `array<float>` Vector with second numeric distribution.
Returns: Measure of maximum absolute difference.
correlation(p, q)
Correlation distance for N dimensions.
Parameters:
p (float[]): `array<float>` Vector with first numeric distribution.
q (float[]): `array<float>` Vector with second numeric distribution.
Returns: Measure of maximum absolute difference.
cosine(p, q)
Cosine distance between provided vectors.
Parameters:
p (float[]): `array<float>` 1D Vector.
q (float[]): `array<float>` 1D Vector.
Returns: The Cosine distance between vectors `p` and `q`.
---
angiogenesis.dkfz.de/oncoexpress/software/cs_clust/cluster.htm
camberra(p, q)
Camberra distance for N dimensions.
Parameters:
p (float[]): `array<float>` Vector with first numeric distribution.
q (float[]): `array<float>` Vector with second numeric distribution.
Returns: Weighted measure of absolute differences between both points.
mae(p, q)
Mean absolute error is a normalized version of the sum of absolute difference (manhattan).
Parameters:
p (float[]): `array<float>` Vector with first numeric distribution.
q (float[]): `array<float>` Vector with second numeric distribution.
Returns: Mean absolute error of vectors `p` and `q`.
mse(p, q)
Mean squared error is a normalized version of the sum of squared difference.
Parameters:
p (float[]): `array<float>` Vector with first numeric distribution.
q (float[]): `array<float>` Vector with second numeric distribution.
Returns: Mean squared error of vectors `p` and `q`.
lorentzian(p, q)
Lorentzian distance between provided vectors.
Parameters:
p (float[]): `array<float>` Vector with first numeric distribution.
q (float[]): `array<float>` Vector with second numeric distribution.
Returns: Lorentzian distance of vectors `p` and `q`.
---
angiogenesis.dkfz.de/oncoexpress/software/cs_clust/cluster.htm
intersection(p, q)
Intersection distance between provided vectors.
Parameters:
p (float[]): `array<float>` Vector with first numeric distribution.
q (float[]): `array<float>` Vector with second numeric distribution.
Returns: Intersection distance of vectors `p` and `q`.
---
angiogenesis.dkfz.de/oncoexpress/software/cs_clust/cluster.htm
penrose(p, q)
Penrose Shape distance between provided vectors.
Parameters:
p (float[]): `array<float>` Vector with first numeric distribution.
q (float[]): `array<float>` Vector with second numeric distribution.
Returns: Penrose shape distance of vectors `p` and `q`.
---
angiogenesis.dkfz.de/oncoexpress/software/cs_clust/cluster.htm
meehl(p, q)
Meehl distance between provided vectors.
Parameters:
p (float[]): `array<float>` Vector with first numeric distribution.
q (float[]): `array<float>` Vector with second numeric distribution.
Returns: Meehl distance of vectors `p` and `q`.
---
angiogenesis.dkfz.de/oncoexpress/software/cs_clust/cluster.htm
edit(x, y)
Edit (aka Levenshtein) distance for indexed strings.
Parameters:
x (int[]): `array<int>` Indexed array.
y (int[]): `array<int>` Indexed array.
Returns: Number of deletions, insertions, or substitutions required to transform source string into target string.
---
generated description:
The Edit distance is a measure of similarity used to compare two strings. It is defined as the minimum number of
operations (insertions, deletions, or substitutions) required to transform one string into another. The operations
are performed on the characters of the strings, and the cost of each operation depends on the specific algorithm
used.
The Edit distance is widely used in various applications such as spell checking, text similarity, and machine
translation. It can also be used for other purposes like finding the closest match between two strings or
identifying the common prefixes or suffixes between them.
---
github.com/disha2sinha/Data-Structures-and-Algorithms/blob/master/Dynamic Programming/EditDistance.cpp
red-gate.com/simple-talk/blogs/string-comparisons-in-sql-edit-distance-and-the-levenshtein-algorithm/
planetcalc.com/1721/
lee(x, y, dsize)
Distance between two indexed strings of equal length.
Parameters:
x (int[]): `array<int>` Indexed array.
y (int[]): `array<int>` Indexed array.
dsize (int): `int` Dictionary size.
Returns: Distance between two strings by accounting for dictionary size.
---
johndcook.com/blog/2020/03/29/lee-distance-codes-and-music/
hamming(x, y)
Distance between two indexed strings of equal length.
Parameters:
x (int[]): `array<int>` Indexed array.
y (int[]): `array<int>` Indexed array.
Returns: Length of different components on both sequences.
---
en.wikipedia.org/wiki/Hamming_distance
jaro(x, y)
Distance between two indexed strings.
Parameters:
x (int[]): `array<int>` Indexed array.
y (int[]): `array<int>` Indexed array.
Returns: Measure of two strings' similarity: the higher the value, the more similar the strings are.
The score is normalized such that `0` equates to no similarities and `1` is an exact match.
---
rosettacode.org/wiki/Jaro_similarity
mahalanobis(p, q, VI)
Mahalanobis distance between two vectors with population inverse covariance matrix.
Parameters:
p (float[]): `array<float>` 1D Vector.
q (float[]): `array<float>` 1D Vector.
VI (matrix<float>): `matrix<float>` Inverse of the covariance matrix.
Returns: The mahalanobis distance between vectors `p` and `q`.
---
people.revoledu.com/kardi/tutorial/Similarity/MahalanobisDistance.html
stat.ethz.ch/R-manual/R-devel/library/stats/html/mahalanobis.html
docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.mahalanobis.html
fidelity(p, q)
Fidelity distance between provided vectors.
Parameters:
p (float[]): `array<float>` 1D Vector.
q (float[]): `array<float>` 1D Vector.
Returns: The Bhattacharyya Coefficient between vectors `p` and `q`.
---
en.wikipedia.org/wiki/Fidelity_of_quantum_states
bhattacharyya(p, q)
Bhattacharyya distance between provided vectors.
Parameters:
p (float[]): `array<float>` 1D Vector.
q (float[]): `array<float>` 1D Vector.
Returns: The Bhattacharyya distance between vectors `p` and `q`.
---
en.wikipedia.org/wiki/Bhattacharyya_distance
hellinger(p, q)
Hellinger distance between provided vectors.
Parameters:
p (float[]): `array<float>` 1D Vector.
q (float[]): `array<float>` 1D Vector.
Returns: The hellinger distance between vectors `p` and `q`.
---
en.wikipedia.org/wiki/Hellinger_distance
jamesmccaffrey.wordpress.com/2021/06/07/the-hellinger-distance-between-two-probability-distributions-using-python/
kumar_hassebrook(p, q)
Kumar Hassebrook distance between provided vectors.
Parameters:
p (float[]): `array<float>` 1D Vector.
q (float[]): `array<float>` 1D Vector.
Returns: The Kumar Hassebrook distance between vectors `p` and `q`.
---
github.com/drostlab/philentropy/blob/7bdefc99f6a7016ad3f90f963d784608edfe74fb/src/distances.h
jaccard(p, q)
Jaccard distance between provided vectors.
Parameters:
p (float[]): `array<float>` 1D Vector.
q (float[]): `array<float>` 1D Vector.
Returns: The Jaccard distance between vectors `p` and `q`.
---
github.com/drostlab/philentropy/blob/7bdefc99f6a7016ad3f90f963d784608edfe74fb/src/distances.h
sorensen(p, q)
Sorensen distance between provided vectors.
Parameters:
p (float[]): `array<float>` 1D Vector.
q (float[]): `array<float>` 1D Vector.
Returns: The Sorensen distance between vectors `p` and `q`.
---
people.revoledu.com/kardi/tutorial/Similarity/BrayCurtisDistance.html
chi_square(p, q, eps)
Chi Square distance between provided vectors.
Parameters:
p (float[]): `array<float>` 1D Vector.
q (float[]): `array<float>` 1D Vector.
eps (float)
Returns: The Chi Square distance between vectors `p` and `q`.
---
uw.pressbooks.pub/appliedmultivariatestatistics/chapter/distance-measures/
stats.stackexchange.com/questions/184101/comparing-two-histograms-using-chi-square-distance
itl.nist.gov/div898/handbook/eda/section3/eda35f.htm
kulczynsky(p, q, eps)
Kulczynsky distance between provided vectors.
Parameters:
p (float[]): `array<float>` 1D Vector.
q (float[]): `array<float>` 1D Vector.
eps (float)
Returns: The Kulczynsky distance between vectors `p` and `q`.
---
github.com/drostlab/philentropy/blob/7bdefc99f6a7016ad3f90f963d784608edfe74fb/src/distances.h
Similarity measures are statistical methods used to quantify the distance between different data sets
or strings. There are various types of similarity measures, including those that compare:
- data points (SSD, Euclidean, Manhattan, Minkowski, Chebyshev, Correlation, Cosine, Camberra, MAE, MSE, Lorentzian, Intersection, Penrose Shape, Meehl),
- strings (Edit(Levenshtein), Lee, Hamming, Jaro),
- probability distributions (Mahalanobis, Fidelity, Bhattacharyya, Hellinger),
- sets (Kumar Hassebrook, Jaccard, Sorensen, Chi Square).
---
These measures are used in various fields such as data analysis, machine learning, and pattern recognition. They
help to compare and analyze similarities and differences between different data sets or strings, which
can be useful for making predictions, classifications, and decisions.
---
References:
en.wikipedia.org/wiki/Similarity_measure
cran.r-project.org/web/packages/SimilarityMeasures/index.html
numerics.mathdotnet.com/Distance
github.com/ngmarchant/comparator
github.com/drostlab/philentropy/blob/7bdefc99f6a7016ad3f90f963d784608edfe74fb/src/distances.h
github.com/scipy/scipy/blob/v1.11.2/scipy/spatial/distance.py
Encyclopedia of Distances, doi.org/10.1007/978-3-662-52844-0
ssd(p, q)
Sum of squared difference for N dimensions.
Parameters:
p (float[]): `array<float>` Vector with first numeric distribution.
q (float[]): `array<float>` Vector with second numeric distribution.
Returns: Measure of distance that calculates the squared euclidean distance.
euclidean(p, q)
Euclidean distance for N dimensions.
Parameters:
p (float[]): `array<float>` Vector with first numeric distribution.
q (float[]): `array<float>` Vector with second numeric distribution.
Returns: Measure of distance that calculates the straight-line (or Euclidean).
manhattan(p, q)
Manhattan distance for N dimensions.
Parameters:
p (float[]): `array<float>` Vector with first numeric distribution.
q (float[]): `array<float>` Vector with second numeric distribution.
Returns: Measure of absolute differences between both points.
minkowski(p, q, p_value)
Minkowsky Distance for N dimensions.
Parameters:
p (float[]): `array<float>` Vector with first numeric distribution.
q (float[]): `array<float>` Vector with second numeric distribution.
p_value (float): `float` P value, default=1.0(1: manhatan, 2: euclidean), does not support chebychev.
Returns: Measure of similarity in the normed vector space.
chebyshev(p, q)
Chebyshev distance for N dimensions.
Parameters:
p (float[]): `array<float>` Vector with first numeric distribution.
q (float[]): `array<float>` Vector with second numeric distribution.
Returns: Measure of maximum absolute difference.
correlation(p, q)
Correlation distance for N dimensions.
Parameters:
p (float[]): `array<float>` Vector with first numeric distribution.
q (float[]): `array<float>` Vector with second numeric distribution.
Returns: Measure of maximum absolute difference.
cosine(p, q)
Cosine distance between provided vectors.
Parameters:
p (float[]): `array<float>` 1D Vector.
q (float[]): `array<float>` 1D Vector.
Returns: The Cosine distance between vectors `p` and `q`.
---
angiogenesis.dkfz.de/oncoexpress/software/cs_clust/cluster.htm
camberra(p, q)
Camberra distance for N dimensions.
Parameters:
p (float[]): `array<float>` Vector with first numeric distribution.
q (float[]): `array<float>` Vector with second numeric distribution.
Returns: Weighted measure of absolute differences between both points.
mae(p, q)
Mean absolute error is a normalized version of the sum of absolute difference (manhattan).
Parameters:
p (float[]): `array<float>` Vector with first numeric distribution.
q (float[]): `array<float>` Vector with second numeric distribution.
Returns: Mean absolute error of vectors `p` and `q`.
mse(p, q)
Mean squared error is a normalized version of the sum of squared difference.
Parameters:
p (float[]): `array<float>` Vector with first numeric distribution.
q (float[]): `array<float>` Vector with second numeric distribution.
Returns: Mean squared error of vectors `p` and `q`.
lorentzian(p, q)
Lorentzian distance between provided vectors.
Parameters:
p (float[]): `array<float>` Vector with first numeric distribution.
q (float[]): `array<float>` Vector with second numeric distribution.
Returns: Lorentzian distance of vectors `p` and `q`.
---
angiogenesis.dkfz.de/oncoexpress/software/cs_clust/cluster.htm
intersection(p, q)
Intersection distance between provided vectors.
Parameters:
p (float[]): `array<float>` Vector with first numeric distribution.
q (float[]): `array<float>` Vector with second numeric distribution.
Returns: Intersection distance of vectors `p` and `q`.
---
angiogenesis.dkfz.de/oncoexpress/software/cs_clust/cluster.htm
penrose(p, q)
Penrose Shape distance between provided vectors.
Parameters:
p (float[]): `array<float>` Vector with first numeric distribution.
q (float[]): `array<float>` Vector with second numeric distribution.
Returns: Penrose shape distance of vectors `p` and `q`.
---
angiogenesis.dkfz.de/oncoexpress/software/cs_clust/cluster.htm
meehl(p, q)
Meehl distance between provided vectors.
Parameters:
p (float[]): `array<float>` Vector with first numeric distribution.
q (float[]): `array<float>` Vector with second numeric distribution.
Returns: Meehl distance of vectors `p` and `q`.
---
angiogenesis.dkfz.de/oncoexpress/software/cs_clust/cluster.htm
edit(x, y)
Edit (aka Levenshtein) distance for indexed strings.
Parameters:
x (int[]): `array<int>` Indexed array.
y (int[]): `array<int>` Indexed array.
Returns: Number of deletions, insertions, or substitutions required to transform source string into target string.
---
generated description:
The Edit distance is a measure of similarity used to compare two strings. It is defined as the minimum number of
operations (insertions, deletions, or substitutions) required to transform one string into another. The operations
are performed on the characters of the strings, and the cost of each operation depends on the specific algorithm
used.
The Edit distance is widely used in various applications such as spell checking, text similarity, and machine
translation. It can also be used for other purposes like finding the closest match between two strings or
identifying the common prefixes or suffixes between them.
---
github.com/disha2sinha/Data-Structures-and-Algorithms/blob/master/Dynamic Programming/EditDistance.cpp
red-gate.com/simple-talk/blogs/string-comparisons-in-sql-edit-distance-and-the-levenshtein-algorithm/
planetcalc.com/1721/
lee(x, y, dsize)
Distance between two indexed strings of equal length.
Parameters:
x (int[]): `array<int>` Indexed array.
y (int[]): `array<int>` Indexed array.
dsize (int): `int` Dictionary size.
Returns: Distance between two strings by accounting for dictionary size.
---
johndcook.com/blog/2020/03/29/lee-distance-codes-and-music/
hamming(x, y)
Distance between two indexed strings of equal length.
Parameters:
x (int[]): `array<int>` Indexed array.
y (int[]): `array<int>` Indexed array.
Returns: Length of different components on both sequences.
---
en.wikipedia.org/wiki/Hamming_distance
jaro(x, y)
Distance between two indexed strings.
Parameters:
x (int[]): `array<int>` Indexed array.
y (int[]): `array<int>` Indexed array.
Returns: Measure of two strings' similarity: the higher the value, the more similar the strings are.
The score is normalized such that `0` equates to no similarities and `1` is an exact match.
---
rosettacode.org/wiki/Jaro_similarity
mahalanobis(p, q, VI)
Mahalanobis distance between two vectors with population inverse covariance matrix.
Parameters:
p (float[]): `array<float>` 1D Vector.
q (float[]): `array<float>` 1D Vector.
VI (matrix<float>): `matrix<float>` Inverse of the covariance matrix.
Returns: The mahalanobis distance between vectors `p` and `q`.
---
people.revoledu.com/kardi/tutorial/Similarity/MahalanobisDistance.html
stat.ethz.ch/R-manual/R-devel/library/stats/html/mahalanobis.html
docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.mahalanobis.html
fidelity(p, q)
Fidelity distance between provided vectors.
Parameters:
p (float[]): `array<float>` 1D Vector.
q (float[]): `array<float>` 1D Vector.
Returns: The Bhattacharyya Coefficient between vectors `p` and `q`.
---
en.wikipedia.org/wiki/Fidelity_of_quantum_states
bhattacharyya(p, q)
Bhattacharyya distance between provided vectors.
Parameters:
p (float[]): `array<float>` 1D Vector.
q (float[]): `array<float>` 1D Vector.
Returns: The Bhattacharyya distance between vectors `p` and `q`.
---
en.wikipedia.org/wiki/Bhattacharyya_distance
hellinger(p, q)
Hellinger distance between provided vectors.
Parameters:
p (float[]): `array<float>` 1D Vector.
q (float[]): `array<float>` 1D Vector.
Returns: The hellinger distance between vectors `p` and `q`.
---
en.wikipedia.org/wiki/Hellinger_distance
jamesmccaffrey.wordpress.com/2021/06/07/the-hellinger-distance-between-two-probability-distributions-using-python/
kumar_hassebrook(p, q)
Kumar Hassebrook distance between provided vectors.
Parameters:
p (float[]): `array<float>` 1D Vector.
q (float[]): `array<float>` 1D Vector.
Returns: The Kumar Hassebrook distance between vectors `p` and `q`.
---
github.com/drostlab/philentropy/blob/7bdefc99f6a7016ad3f90f963d784608edfe74fb/src/distances.h
jaccard(p, q)
Jaccard distance between provided vectors.
Parameters:
p (float[]): `array<float>` 1D Vector.
q (float[]): `array<float>` 1D Vector.
Returns: The Jaccard distance between vectors `p` and `q`.
---
github.com/drostlab/philentropy/blob/7bdefc99f6a7016ad3f90f963d784608edfe74fb/src/distances.h
sorensen(p, q)
Sorensen distance between provided vectors.
Parameters:
p (float[]): `array<float>` 1D Vector.
q (float[]): `array<float>` 1D Vector.
Returns: The Sorensen distance between vectors `p` and `q`.
---
people.revoledu.com/kardi/tutorial/Similarity/BrayCurtisDistance.html
chi_square(p, q, eps)
Chi Square distance between provided vectors.
Parameters:
p (float[]): `array<float>` 1D Vector.
q (float[]): `array<float>` 1D Vector.
eps (float)
Returns: The Chi Square distance between vectors `p` and `q`.
---
uw.pressbooks.pub/appliedmultivariatestatistics/chapter/distance-measures/
stats.stackexchange.com/questions/184101/comparing-two-histograms-using-chi-square-distance
itl.nist.gov/div898/handbook/eda/section3/eda35f.htm
kulczynsky(p, q, eps)
Kulczynsky distance between provided vectors.
Parameters:
p (float[]): `array<float>` 1D Vector.
q (float[]): `array<float>` 1D Vector.
eps (float)
Returns: The Kulczynsky distance between vectors `p` and `q`.
---
github.com/drostlab/philentropy/blob/7bdefc99f6a7016ad3f90f963d784608edfe74fb/src/distances.h
Release Notes
 v2 -Update to V6. Corrected a issue with the Correlation function were it was calculating the dissimilarity(squared pearson distance) and not pearson correlation.Pine library
In true TradingView spirit, the author has published this Pine code as an open-source library so that other Pine programmers from our community can reuse it. Cheers to the author! You may use this library privately or in other open-source publications, but reuse of this code in publications is governed by House Rules.
Disclaimer
The information and publications are not meant to be, and do not constitute, financial, investment, trading, or other types of advice or recommendations supplied or endorsed by TradingView. Read more in the Terms of Use.
Pine library
In true TradingView spirit, the author has published this Pine code as an open-source library so that other Pine programmers from our community can reuse it. Cheers to the author! You may use this library privately or in other open-source publications, but reuse of this code in publications is governed by House Rules.
Disclaimer
The information and publications are not meant to be, and do not constitute, financial, investment, trading, or other types of advice or recommendations supplied or endorsed by TradingView. Read more in the Terms of Use.

