For every item i in a grocery store, a set si is used to represent the IDs of transactions in which i is purchased. Assume that the data set to be analyzed contains hundreds of thousands of such transactions.
In order to analyze the proximity between any two of these sets si and sj , which measure, Jaccard or Hamming, would be more appropriate and why ?
Hamming could be the measure used because the transactions would almost be similar
We would want mainly to concentrate on the differences between the data points instead of their similarities.
because the differences which are minute can help us determine the proximity.
Suppose we have lots of similar purchases, if jaccard is used the entire product would be 1 which says it is dissimilar
So using hamming we can find the differences which are negligible and will find the proximity
Get Answers For Free
Most questions answered within 1 hours.