Programming Challenge Description:
Find the most similar pair using cosine
Given a sequence of passages, find the most similar pair using the cosine similarity measure. If there are ties (more than one pair have the same similarity score), return the one with the leftmost passage. If there’s only 1 passage in a sequence, return it. Do not return similarity of a passage with itself.
How to represent a passage in the vocabulary space:
The vocabulary space includes all words encountered in any of the passages ignoring their case. Any non-alphanumeric characters are also ignored, spaces trimmed.
A passage is represented as a vector of its word frequencies in a vocabulary space.
For example, if we have a set of passages “I like ice cream”, “My sister likes ice cream too”, the vocabulary space would include “coordinates”: [I, like, likes, ice, cream, my, sister, too]. Note that (1) we do not ask to bring different wordforms o a common lexeme (as “like” and “likes”) so they are considered different words; (2) we consider complex words as separate lexemes (e.g., “ice” and “cream” from “ice cream”). Then the first passage will be represented as [1, 1, 0, 1, 1, 0, 0, 0] and the second passage will be represented as [0, 0, 1, 1, 1, 1, 1, 1].
The cosine measure (click Attachment above for equation image):
The cosine between 2 vectors A and B equals the scalar product of the vectors divided by the product of vector length.
For example, in a 3 dimensional space for vector A=(1, 2, 2) and vector B=(1, 0, 0), the cosine measure is (1*1 + 2*0 + 2*0) / ((1^2 + 2^2 + 2^2)^(1/2) * (1^2 + 0^2 + 0^2)^(1/2)) = 1/3
The larger the cosine (reminder: cosine max value is 1), the more similar passages are.
Input:
A single line comprising the passages (strings) to be processed, delimited by | characters. The | characters are not considered part of any passage.
Output:
The comma-separated (no space) numbers of the most similar passages. For a single passage return 0.
Test 1
Test Input:
IBM cognitive computing|IBM "cognitive" computing is a revolution| ibm cognitive computing|'IBM Cognitive Computing' is a revolution?
Expected Output
0,2
Test 2
Test Input
The cat is on the mat | The cat likes the mat | The dog is on the mat | The dog chews the mat
Expected Output
0,2