Comparison
Overview
A comparison operator evaluates two inputs and computes the similarity based on a user-defined distance measure and a user-defined threshold.
The distance measure always outputs 0 for a perfect match, and a higher value for an imperfect match. Only distance values between 0 and threshold will result in a positive similarity score. Therefore it is important to know how the distance measures work and what the range of their output values is in order to set a threshold value sensibly.
Parameters
Parameter | Description |
---|---|
required | If required is true, the parent aggregation only yields a confidence value if the given inputs have values for both instances. |
weight | Weight of this comparison. The weight is used by some aggregations such as the weighted average aggregation. |
threshold | The maximum distance. For normalized distance measures, the threshold should be between 0.0 and 1.0. |
distanceMeasure | The used distance measure. For a list of available distance measures see below. |
Inputs | The 2 inputs for the comparison. |
Examples
XML
<Compare metric="levenshteinDistance" threshold="2.0" required="true">
<TransformInput function="lowerCase">
<Input path="?a/rdfs:label"/>
</TransformInput>
<TransformInput function="lowerCase">
<Input path="?b/rdfs:label"/>
</TransformInput>
</Compare>
Scala API
Comparison(
id = "labels",
required = false,
weight = 1,
threshold = 2.0,
metric = LevenshteinDistance()
inputs = PathInput(path = Path.parse("?a/rdfs:label")) ::
PathInput(path = Path.parse("?b/rdfs:label")) :: Nil
)
Threshold
The threshold is used to convert the computed distance to a confidence between -1.0 and 1.0. Links will be generated for confidences above 0 while higher confidence values imply a higher similarity between the compared entities.
Distance Measures
Character-Based Distance Measures
Character-based distance measures compare strings on the character level. They are well suited for handling typographical errors.
Measure | Description | Normalized |
---|---|---|
levenshteinDistance | Levenshtein distance. The minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character | No |
levenshtein | The levensthein distance normalized to the interval [0,1] | Yes |
jaro | Jaro distance metric. Simple distance metric originally developed to compare person names. | Yes |
jaroWinkler | Jaro-Winkler distance measure. The Jaro–Winkler distance metric is designed and best suited for short strings such as person names | Yes |
equality | 0 if strings are equal, 1 otherwise. | Yes |
inequality | 1 if strings are equal, 0 otherwise. | Yes |
Example:
<Compare metric="levenshteinDistance" threshold="2">
<Input path="?a/rdfs:label" />
<Input path="?b/gn:name" />
</Compare>
Token-Based Distance Measures
While character-based distance measures work well for typographical
errors, there are a number of tasks where token-base distance measures are better suited:
- Strings where parts are reordered e.g. “John Doe” and “Doe, John”
- Texts consisting of multiple words
Measure | Description | Normalized |
---|---|---|
jaccard | Jaccard distance coefficient. | Yes |
dice | Dice distance coefficient. | Yes |
softjaccard | Soft Jaccard similarity coefficient. Same as Jaccard distance but values within an levenhstein distance of ‘maxDistance’ are considered equivalent. | Yes |
Example:
<Compare metric="jaccard" threshold="0.2">
<TransformInput function="tokenize">
<Input path="?a/rdfs:label" />
</TransformInput>
<TransformInput function="tokenize">
<Input path="?b/gn:name" />
</TransformInput>
</Compare>
Special Purpose Distance Measures
A number of distance measures are available that are designed to compare specific types of data, e.g., numeric values.
Measure | Description | Normalized |
---|---|---|
num | Computes the numeric difference between two numbers Parameters: minValue , maxValue The minimum and maximum values which occur in the datasource |
No |
date | Computes the distance between two dates . Returns the difference in days | No |
dateTime | Computes the distance between two date time values . Returns the difference in seconds | No |
wgs84 | Computes the geographical distance between two points. Parameters: unit The unit in which the distance is measured. Allowed values: “meter” or “m” , “kilometer” or “km”Author: Konrad Höffner |
No |
Example:
<Compare metric="wgs84" threshold="50">
<Input path="?a/wgs84:geometry" />
<Input path="?b/wgs84:geometry" />
<Param name="unit" value="km"/>
</Compare>
Comments are disabled for this space. In order to enable comments, Messages tool must be added to project.
You can add Messages tool from Tools section on the Admin tab.