Comparison

Version 22, last updated by aweilandt at 2014-06-26

Overview

A comparison operator evaluates two inputs and computes the similarity based on a user-defined distance measure and a user-defined threshold.

The distance measure always outputs 0 for a perfect match, and a higher value for an imperfect match. Only distance values between 0 and threshold will result in a positive similarity score. Therefore it is important to know how the distance measures work and what the range of their output values is in order to set a threshold value sensibly.

Parameters

Parameter Description
required If required is true, the parent aggregation only yields a confidence value if the given inputs have values for both instances.
weight Weight of this comparison. The weight is used by some aggregations such as the weighted average aggregation.
threshold The maximum distance. For normalized distance measures, the threshold should be between 0.0 and 1.0.
distanceMeasure The used distance measure. For a list of available distance measures see below.
Inputs The 2 inputs for the comparison.

Examples

XML

<Compare metric="levenshteinDistance" threshold="2.0" required="true">
  <TransformInput function="lowerCase">
    <Input path="?a/rdfs:label"/>
  </TransformInput>
  <TransformInput function="lowerCase">
    <Input path="?b/rdfs:label"/>
  </TransformInput>
</Compare>

Scala API

Comparison(
  id = "labels",
  required = false,
  weight = 1,
  threshold = 2.0,
  metric = LevenshteinDistance()
  inputs = PathInput(path = Path.parse("?a/rdfs:label")) ::
           PathInput(path = Path.parse("?b/rdfs:label")) :: Nil  
)

Threshold

The threshold is used to convert the computed distance to a confidence between -1.0 and 1.0. Links will be generated for confidences above 0 while higher confidence values imply a higher similarity between the compared entities.


Distance Measures

Character-Based Distance Measures

Character-based distance measures compare strings on the character level. They are well suited for handling typographical errors.

Measure Description Normalized
levenshteinDistance Levenshtein distance. The minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character No
levenshtein The levensthein distance normalized to the interval [0,1] Yes
jaro Jaro distance metric. Simple distance metric originally developed to compare person names. Yes
jaroWinkler Jaro-Winkler distance measure. The Jaro–Winkler distance metric is designed and best suited for short strings such as person names Yes
equality 0 if strings are equal, 1 otherwise. Yes
inequality 1 if strings are equal, 0 otherwise. Yes

Example:

<Compare metric="levenshteinDistance" threshold="2">
  <Input path="?a/rdfs:label" />
  <Input path="?b/gn:name" />
</Compare>

Token-Based Distance Measures

While character-based distance measures work well for typographical
errors, there are a number of tasks where token-base distance measures are better suited:

  • Strings where parts are reordered e.g. “John Doe” and “Doe, John”
  • Texts consisting of multiple words
Measure Description Normalized
jaccard Jaccard distance coefficient. Yes
dice Dice distance coefficient. Yes
softjaccard Soft Jaccard similarity coefficient. Same as Jaccard distance but values within an levenhstein distance of ‘maxDistance’ are considered equivalent. Yes

Example:

<Compare metric="jaccard" threshold="0.2">
  <TransformInput function="tokenize">
    <Input path="?a/rdfs:label" />
  </TransformInput>
  <TransformInput function="tokenize">
    <Input path="?b/gn:name" />
  </TransformInput>
</Compare>

Special Purpose Distance Measures

A number of distance measures are available that are designed to compare specific types of data, e.g., numeric values.

Measure Description Normalized
num Computes the numeric difference between two numbers
Parameters:
minValue, maxValue The minimum and maximum values which occur in the datasource
No
date Computes the distance between two dates . Returns the difference in days No
dateTime Computes the distance between two date time values . Returns the difference in seconds No
wgs84 Computes the geographical distance between two points.
Parameters:
unit The unit in which the distance is measured. Allowed values: “meter” or “m” , “kilometer” or “km”
Author: Konrad Höffner
No

Example:

<Compare metric="wgs84" threshold="50">
  <Input path="?a/wgs84:geometry" />
  <Input path="?b/wgs84:geometry" />
  <Param name="unit" value="km"/>
</Compare>

Comments are disabled for this space. In order to enable comments, Messages tool must be added to project.

You can add Messages tool from Tools section on the Admin tab.