Package 'reclin2' reference manual

Title:	Record Linkage Toolkit
Description:	Functions to assist in performing probabilistic record linkage and deduplication: generating pairs, comparing records, em-algorithm for estimating m- and u-probabilities (I. Fellegi & A. Sunter (1969) <doi:10.1080/01621459.1969.10501049>, T.N. Herzog, F.J. Scheuren, & W.E. Winkler (2007), "Data Quality and Record Linkage Techniques", ISBN:978-0-387-69502-0), forcing one-to-one matching. Can also be used for pre- and post-processing for machine learning methods for record linkage. Focus is on memory, CPU performance and flexibility.
Authors:	Jan van der Laan [aut, cre]
Maintainer:	Jan van der Laan <[email protected]>
License:	GPL-3
Version:	0.5.0
Built:	2025-03-05 04:47:24 UTC
Source:	https://github.com/djvanderlaan/reclin2

Add a variable from one of the data sets to pairs

Description

Add a variable from one of the data sets to pairs

Usage

add_from_x(pairs, variable, new_variable = variable, ...)

add_from_y(pairs, variable, new_variable = variable, ...)
add_from_x(pairs, variable, new_variable = variable, ...)

add_from_y(pairs, variable, new_variable = variable, ...)

Arguments

`pairs`	`data.table` with pairs. Should contain the columns `.x` and `.y`.
`variable`	name of the variable that should be added
`new_variable`	optional variable name of the new variable in `pairs`. When omitted `variable` is used.
`...`	other parameters are passed on to `compare_vars`. Especially `inplace`, `x` and `y` might be of interest.

Value

Returns the pairs with the column added. When inplace = TRUE pairs is returned invisibly and the original pairs is modified.

Call a function on each of the worker nodes and pass it the pairs

Description

Call a function on each of the worker nodes and pass it the pairs

Usage

cluster_call(pairs, fun, ...)
cluster_call(pairs, fun, ...)

Arguments

`pairs`	an object or type `cluster_pairs` as created for example by `cluster_pair`.
`fun`	a function to call on each of the worker nodes. See details on the arguments of this function.
`...`	additional arguments are passed on to `fun`.

Details

The function will have to accept the following arguments as its first three arguments:

pairs: the data.table with the pairs of the worker node.
x: a data.table with the portion of x present on the worker node.
y: a data.table with y.

Value

The function will return a list with for each worker the result of the function call. When the functions return NULL the result is returned invisibly. Because the result is returned to main node, make sure you don't accidentally return all pairs. If you don't want to return anything end your function with NULL.

Examples

# Generate some pairs
library(parallel)
data("linkexample1", "linkexample2")
cl <- makeCluster(2)

pairs <- cluster_pair(cl, linkexample1, linkexample2)
compare_pairs(pairs, c("lastname", "firstname", "address", "sex"))

# Add a new column to pairs
cluster_call(pairs, function(pairs, ...) {
  pairs[, name := firstname & lastname]
  # we don't want to return the pairs; so make sure to return something
  # else
  NULL
})

# Get the number of pairs on each node
lenghts <- cluster_call(pairs, function(pairs, ...) {
  nrow(pairs)
})
lengths <- unlist(lenghts)
lenghts

# Cleanup
stopCluster(cl)

# Generate some pairs
library(parallel)
data("linkexample1", "linkexample2")
cl <- makeCluster(2)

pairs <- cluster_pair(cl, linkexample1, linkexample2)
compare_pairs(pairs, c("lastname", "firstname", "address", "sex"))

# Add a new column to pairs
cluster_call(pairs, function(pairs, ...) {
  pairs[, name := firstname & lastname]
  # we don't want to return the pairs; so make sure to return something
  # else
  NULL
})

# Get the number of pairs on each node
lenghts <- cluster_call(pairs, function(pairs, ...) {
  nrow(pairs)
})
lengths <- unlist(lenghts)
lenghts

# Cleanup
stopCluster(cl)

Collect pairs from cluster nodes

Description

Collect pairs from cluster nodes

Usage

cluster_collect(pairs, select = NULL, clear = FALSE)
cluster_collect(pairs, select = NULL, clear = FALSE)

Arguments

`pairs`	an object or type `cluster_pairs` as created for example by `cluster_pair`.
`select`	the name of a logical column that is used to select the pairs that will be collected
`clear`	remove the pairs from the cluster nodes

Value

Returns an object of type pairs which is a data.table. This object can be used as a regular (non-cluster) set of pairs

Examples

library(parallel)
data("linkexample1", "linkexample2")
cl <- makeCluster(2)


pairs <- cluster_pair(cl, linkexample1, linkexample2)
local_pairs <- cluster_collect(pairs, clear = FALSE)

compare_pairs(pairs, c("lastname", "firstname", "address", "sex"))
model <- problink_em(~ lastname + firstname + address + sex, data = pairs)
predict(model, pairs, type = "mpost", add = TRUE, binary = TRUE)
# Select pairs with a mpost > 0.5
select_threshold(pairs, "selected", "mpost", 0.5)
# Collect the selected pairs
local_pairs <- cluster_collect(pairs, "selected")

stopCluster(cl)
library(parallel)
data("linkexample1", "linkexample2")
cl <- makeCluster(2)


pairs <- cluster_pair(cl, linkexample1, linkexample2)
local_pairs <- cluster_collect(pairs, clear = FALSE)

compare_pairs(pairs, c("lastname", "firstname", "address", "sex"))
model <- problink_em(~ lastname + firstname + address + sex, data = pairs)
predict(model, pairs, type = "mpost", add = TRUE, binary = TRUE)
# Select pairs with a mpost > 0.5
select_threshold(pairs, "selected", "mpost", 0.5)
# Collect the selected pairs
local_pairs <- cluster_collect(pairs, "selected")

stopCluster(cl)

Call a function on each of the worker nodes to modify the pairs on the node

Description

Call a function on each of the worker nodes to modify the pairs on the node

Usage

cluster_modify_pairs(pairs, fun, ..., new_name = NULL)
cluster_modify_pairs(pairs, fun, ..., new_name = NULL)

Arguments

`pairs`	an object or type `cluster_pairs` as created for example by `cluster_pair`.
`fun`	a function to call on each of the worker nodes. See details on the arguments of this function.
`...`	additional arguments are passed on to `fun`.
`new_name`	name of new object to assign the pairs to on the cluster nodes.

Details

The function will have to accept the following arguments as its first three arguments:

pairs: the data.table with the pairs of the worker node.
x: a data.table with the portion of x present on the worker node.
y: a data.table with y.

The function should either return a data.table with the new pairs, or NULL. When a data.table is returned this values will replace the pairs when new_name is missing or create new pairs in the environment new_name. When the function returns NULL it is assumed that the function modified the pairs by reference (e.g. using pairs[, new_var := new_val]). Note that this also means that new_name is ignored.

Value

Will return a cluster_pairs object. When new_name is not given it will return the input pairs invisibly. Otherwise it will return a new cluster_pairs object.

Examples

# Generate some pairs
library(parallel)
data("linkexample1", "linkexample2")
cl <- makeCluster(2)

pairs <- cluster_pair(cl, linkexample1, linkexample2)
compare_pairs(pairs, c("lastname", "firstname", "address", "sex"))

# Create a new set of pairs containing a random sample of the original
# pairs.
sample <-  cluster_call(pairs, new_name = "sample", function(pairs, ...) {
  sel <- sample(nrow(pairs), round(nrow(pairs)*0.1))
  pairs[sel, ]
})

# Cleanup
stopCluster(cl)

# Generate some pairs
library(parallel)
data("linkexample1", "linkexample2")
cl <- makeCluster(2)

pairs <- cluster_pair(cl, linkexample1, linkexample2)
compare_pairs(pairs, c("lastname", "firstname", "address", "sex"))

# Create a new set of pairs containing a random sample of the original
# pairs.
sample <-  cluster_call(pairs, new_name = "sample", function(pairs, ...) {
  sel <- sample(nrow(pairs), round(nrow(pairs)*0.1))
  pairs[sel, ]
})

# Cleanup
stopCluster(cl)

Generate all possible pairs using multiple processes

Description

Generates all combinations of records from x and y.

Usage

cluster_pair(cluster, x, y, deduplication = FALSE, name = "default")
cluster_pair(cluster, x, y, deduplication = FALSE, name = "default")

Arguments

`cluster`	a cluster object as created by `makeCluster` from `parallel` or from the `snow` package.
`x`	first `data.frame`
`y`	second `data.frame`. Ignored when `deduplication = TRUE`.
`deduplication`	generate pairs from only `x`. Ignore `y`. This is usefull for deduplication of `x`.
`name`	the name of the resulting object to create locally on the different R processes.

Details

Generating (all) pairs of the records of two data sets, is usually the first step when linking the two data sets.

x is split into length{cluster} parts which are distributed over the worker nodes. y is copied to each of the nodes. On the nodes then pair is called. The pairs are stored in the global object reclin_env on the nodes in the variable name. The pairs can then be further processes using functions such as compare_pairs, and tabulate_patterns. The function cluster_collect collects the pairs from each of the nodes.

Value

A object of type cluster_pairs which is a list containing the cluster and the name of the pairs object on the cluster nodes. For the pairs objects created on the nodes see the documentation of pair.

Examples

library(parallel)
data("linkexample1", "linkexample2")
cl <- makeCluster(2)

pairs <- cluster_pair(cl, linkexample1, linkexample2)
stopCluster(cl)

library(parallel)
data("linkexample1", "linkexample2")
cl <- makeCluster(2)

pairs <- cluster_pair(cl, linkexample1, linkexample2)
stopCluster(cl)

Generate pairs using simple blocking using multiple processes

Description

Generates all combinations of records from x and y where the blocking variables are equal.

Usage

cluster_pair_blocking(
  cluster,
  x,
  y,
  on,
  deduplication = FALSE,
  name = "default"
)
cluster_pair_blocking(
  cluster,
  x,
  y,
  on,
  deduplication = FALSE,
  name = "default"
)

Arguments

`cluster`	a cluster object as created by `makeCluster` from `parallel` or from the `snow` package.
`x`	first `data.frame`
`y`	second `data.frame`. Ignored when `deduplication = TRUE`.
`on`	the variables defining the blocks or strata for which all pairs of `x` and `y` will be generated.
`deduplication`	generate pairs from only `x`. Ignore `y`. This is usefull for deduplication of `x`.
`name`	the name of the resulting object to create locally on the different R processes.

Details

Generating (all) pairs of the records of two data sets, is usually the first step when linking the two data sets. However, this often results in a too large number of records. Therefore, blocking is usually applied.

x is split into length{cluster} parts which are distributed over the worker nodes. y is copied to each of the nodes. On the nodes then pair_blocking is called. The pairs are stored in the global object reclin_env on the nodes in the variable name. The pairs can then be further processes using functions such as compare_pairs, and tabulate_patterns. The function cluster_collect collects the pairs from each of the nodes.

Value

Examples

library(parallel)
data("linkexample1", "linkexample2")
cl <- makeCluster(2)

pairs <- cluster_pair_blocking(cl, linkexample1, linkexample2, "postcode")
stopCluster(cl)

library(parallel)
data("linkexample1", "linkexample2")
cl <- makeCluster(2)

pairs <- cluster_pair_blocking(cl, linkexample1, linkexample2, "postcode")
stopCluster(cl)

Generate pairs with a minimal similarity using multiple processes

Description

Generates all combinations of records from x and y where the blocking variables are equal.

Usage

cluster_pair_minsim(
  cluster,
  x,
  y,
  on,
  minsim = 0,
  on_blocking = character(0),
  comparators = list(default_comparator),
  default_comparator = cmp_identical(),
  keep_simsum = TRUE,
  deduplication = FALSE,
  name = "default"
)
cluster_pair_minsim(
  cluster,
  x,
  y,
  on,
  minsim = 0,
  on_blocking = character(0),
  comparators = list(default_comparator),
  default_comparator = cmp_identical(),
  keep_simsum = TRUE,
  deduplication = FALSE,
  name = "default"
)

Arguments

`cluster`	a cluster object as created by `makeCluster` from `parallel` or `makeCluster` from `snow`.
`x`	first `data.frame`
`y`	second `data.frame`. Ignored when `deduplication = TRUE`.
`on`	the variables defining the blocks or strata for which all pairs of `x` and `y` will be generated.
`minsim`	minimal similarity score.
`on_blocking`	variables for which the pairs have to match.
`comparators`	named list of functions with which the variables are compared. This function should accept two vectors. Function should either return a vector or a `data.table` with multiple columns.
`default_comparator`	variables for which no comparison function is defined using `comparators` is compares with the function `default_comparator`.
`keep_simsum`	add a variable `minsim` to the result with the similarity score of the pair.
`deduplication`	generate pairs from only `x`. Ignore `y`. This is usefull for deduplication of `x`.
`name`	the name of the resulting object to create locally on the different R processes.

Details

Generating (all) pairs of the records of two data sets, is usually the first step when linking the two data sets. However, this often results in a too large number of records. pair_minsim will only keep pairs with a similarity score equal or larger than minsim. The similarity score is calculated by summing the results of the comparators for all variables of on.

x is split into length{cluster} parts which are distributed over the worker nodes. y is copied to each of the nodes. On the nodes then cluster_pair_minsim is called. The pairs are stored in the global object reclin_env on the nodes in the variable name. The pairs can then be further processes using functions such as compare_pairs, and tabulate_patterns. The function cluster_collect collects the pairs from each of the nodes.

Value

Examples

library(parallel)
data("linkexample1", "linkexample2")
cl <- makeCluster(2)

# Either address or postcode has to match to keep a pair
pairs <- cluster_pair_minsim(cl, linkexample1, linkexample2, 
   on = c("postcode", "address"), minsim = 1)
stopCluster(cl)

library(parallel)
data("linkexample1", "linkexample2")
cl <- makeCluster(2)

# Either address or postcode has to match to keep a pair
pairs <- cluster_pair_minsim(cl, linkexample1, linkexample2, 
   on = c("postcode", "address"), minsim = 1)
stopCluster(cl)

Comparison functions

Description

Comparison functions

Usage

cmp_identical()

cmp_jarowinkler(threshold = 0.95)

jaro_winkler(threshold = 0.8)

cmp_lcs(threshold = 0.8)

lcs(threshold = 0.8)

cmp_jaccard(threshold = 0.8)

jaccard(threshold = 0.8)
cmp_identical()

cmp_jarowinkler(threshold = 0.95)

jaro_winkler(threshold = 0.8)

cmp_lcs(threshold = 0.8)

lcs(threshold = 0.8)

cmp_jaccard(threshold = 0.8)

jaccard(threshold = 0.8)

Arguments

threshold

threshold to use for the Jaro-Winkler string distance when creating a binary result.

Details

A comparison function should accept two arguments: both vectors. When the function is called with both arguments it should compare the elements in the first vector to those in the second. When called in this way, both vectors have the same length. What the function should return depends on the methods used to score the pairs. Usually the comparison functions return a similarity score with a value of 0 indication complete difference and a value > 0 indicating similarity (often a value of 1 will indicate perfect similarity).

Some methods, such as problink_em, can handle similarity scores, but also need binary values (0/FALSE = complete dissimilarity; 1/TRUE = complete similarity). In order to allow for this the comparison function is called with one argument.

When the comparison is called with one argument, it is passed the result of a previous comparison. The function should translate that result to a binary (TRUE/FALSE or 1/0) result. The result should not contain missing values.

The jaro_winkler, lcs and jaccard functions use the corresponding methods from stringdist except that they are transformed from a distance to a similarity score.

Value

The functions return a comparison function (see details).

Warning

The functions identical, jaro_winkler, lcs and jaccard are deprecated and will be removed in future versions of the package. Instead use the functions cmp_identical, cmp_jarowinkler, cmp_lcs and cmp_jaccard.

Examples

cmp <- cmp_identical()
x <- cmp(c("john", "mary", "susan", "jack"), 
         c("johan", "mary", "susanna", NA))
# Applying the comparison function to the result of the comparison results 
# in a logical result, with NA's and values of FALSE set to FALSE
cmp(x)

cmp <- cmp_jarowinkler(0.95)
x <- cmp(c("john", "mary", "susan", "jack"), 
         c("johan", "mary", "susanna", NA))
# Applying the comparison function to the result of the comparison results 
# in a logical result, with NA's and values below the threshold FALSE
cmp(x)



cmp <- cmp_identical()
x <- cmp(c("john", "mary", "susan", "jack"), 
         c("johan", "mary", "susanna", NA))
# Applying the comparison function to the result of the comparison results 
# in a logical result, with NA's and values of FALSE set to FALSE
cmp(x)

cmp <- cmp_jarowinkler(0.95)
x <- cmp(c("john", "mary", "susan", "jack"), 
         c("johan", "mary", "susanna", NA))
# Applying the comparison function to the result of the comparison results 
# in a logical result, with NA's and values below the threshold FALSE
cmp(x)

Compare pairs on a set of variables common in both data sets

Description

Compare pairs on a set of variables common in both data sets

Usage

## S3 method for class 'cluster_pairs'
compare_pairs(
  pairs,
  on,
  comparators = list(default_comparator),
  default_comparator = cmp_identical(),
  new_name = NULL,
  ...
)

compare_pairs(
  pairs,
  on,
  comparators = list(default_comparator),
  default_comparator = cmp_identical(),
  ...
)

## S3 method for class 'pairs'
compare_pairs(
  pairs,
  on,
  comparators = list(default_comparator),
  default_comparator = cmp_identical(),
  x = attr(pairs, "x"),
  y = attr(pairs, "y"),
  inplace = FALSE,
  ...
)
## S3 method for class 'cluster_pairs'
compare_pairs(
  pairs,
  on,
  comparators = list(default_comparator),
  default_comparator = cmp_identical(),
  new_name = NULL,
  ...
)

compare_pairs(
  pairs,
  on,
  comparators = list(default_comparator),
  default_comparator = cmp_identical(),
  ...
)

## S3 method for class 'pairs'
compare_pairs(
  pairs,
  on,
  comparators = list(default_comparator),
  default_comparator = cmp_identical(),
  x = attr(pairs, "x"),
  y = attr(pairs, "y"),
  inplace = FALSE,
  ...
)

Arguments

`pairs`	`data.table` with pairs. Should contain the columns `.x` and `.y`.
`on`	character vector of variables that should be compared.
`comparators`	named list of functions with which the variables are compared. This function should accept two vectors. Function should either return a vector or a `data.table` with multiple columns.
`default_comparator`	variables for which no comparison function is defined using `comparators` is compares with the function `default_comparator`.
`new_name`	name of new object to assign the pairs to on the cluster nodes.
`...`	Ignored for now
`x`	`data.table` with one half of the pairs.
`y`	`data.table` with the other half of the pairs.
`inplace`	logical indicating whether `pairs` should be modified in place. When pairs is large this can be more efficient.

Details

It is assumed the variables in on are present in both x and y. Variables with the same names are added to pairs. When the comparator returns a data.table multiple columns are added to pairs. The names of these columns are variable pasted together with the names of the data.table returned by comparator (separated by "_").

Value

Returns the data.table pairs with one or more columns added in case of compare_pairs.pairs.

In case of compare_pairs.cluster_pairs, compare_pair.pairs is called on each cluster node and the resulting pairs are assigned to new_name in the environment reclin_env. When new_name is not given (or equal to NULL) the original pairs on the nodes are overwritten.

Compare pairs on given variables

Description

Compare pairs on given variables

Usage

## S3 method for class 'cluster_pairs'
compare_vars(
  pairs,
  variable,
  on_x = variable,
  on_y = on_x,
  comparator = cmp_identical(),
  new_name = NULL,
  ...
)

compare_vars(
  pairs,
  variable,
  on_x = variable,
  on_y = on_x,
  comparator = cmp_identical(),
  ...
)

## S3 method for class 'pairs'
compare_vars(
  pairs,
  variable,
  on_x = variable,
  on_y = on_x,
  comparator = cmp_identical(),
  x = attr(pairs, "x"),
  y = attr(pairs, "y"),
  inplace = FALSE,
  ...
)
## S3 method for class 'cluster_pairs'
compare_vars(
  pairs,
  variable,
  on_x = variable,
  on_y = on_x,
  comparator = cmp_identical(),
  new_name = NULL,
  ...
)

compare_vars(
  pairs,
  variable,
  on_x = variable,
  on_y = on_x,
  comparator = cmp_identical(),
  ...
)

## S3 method for class 'pairs'
compare_vars(
  pairs,
  variable,
  on_x = variable,
  on_y = on_x,
  comparator = cmp_identical(),
  x = attr(pairs, "x"),
  y = attr(pairs, "y"),
  inplace = FALSE,
  ...
)

Arguments

`pairs`	`data.table` with pairs. Should contain the columns `.x` and `.y`.
`variable`	character vector with name of resulting column name that is added to pairs.
`on_x`	character vector with the column names from `x` on which to compare.
`on_y`	character vector with the column names from `y` on which to compare.
`comparator`	function with which the variables are compared. When `on_x` and `on_y` have length 1, this function should accept two vectors. Otherwise it will receive two `data.tables`. Function should either return a vector or a `data.table` with multiple columns.
`new_name`	name of new object to assign the pairs to on the cluster nodes.
`...`	Passed on to the comparator function.
`x`	`data.table` with one half of the pairs.
`y`	`data.table` with the other half of the pairs.
`inplace`	logical indicating whether `pairs` should be modified in place. When pairs is large this can be more efficient.

Details

When comparator returns a data.table multiple columns are added to pairs. The names of these columns are variable pasted together with the names of the data.table returned by comparator (separated by "_").

Value

Returns the data.table pairs with one or more columns added.

Deduplication using equivalence groups

Description

Deduplication using equivalence groups

Usage

deduplicate_equivalence(pairs, variable, selection, x = attr(pairs, "x"))
deduplicate_equivalence(pairs, variable, selection, x = attr(pairs, "x"))

Arguments

`pairs`	a `pairs` object, such as generated by `pair_blocking`
`variable`	name of the variable to create in `x` that will contain the group labels.
`selection`	a logical variable with the same length as `pairs` has rows, or the name of such a variable in `pairs`. Pairs are only selected when `select` is `TRUE`. When missing it is assumed all pairs are selected.
`x`	the first data set; when missing `attr(pairs, "x")` is used.

Value

Returns x with a variable containing the group labels. Records with the same group label (should) correspond to the same entity.

Get a subset of pairs to inspect

Description

Get a subset of pairs to inspect

Usage

get_inspect_pairs(
  pairs,
  variable,
  threshold,
  position = NULL,
  n = 11,
  x = attr(pairs, "x"),
  y = attr(pairs, "y")
)
get_inspect_pairs(
  pairs,
  variable,
  threshold,
  position = NULL,
  n = 11,
  x = attr(pairs, "x"),
  y = attr(pairs, "y")
)

Arguments

`pairs`	`data.table` with pairs.
`variable`	name of variable to base the selection on; should be a variable with the similarity score of the pairs.
`threshold`	the threshold around which to select pairs. Used when position is not given.
`position`	select pairs around this position (based on order of `variable`), e.g. `position = 1` will select the pairs with the highest similarity score.
`n`	number of pairs to select. Pairs are selected symmetric around the theshold.
`x`	`data.table` with one half of the pairs.
`y`	`data.table` with the other half of the pairs.

Value

Returns a list with elements pairs with the selected pairs; x records from x corresponding to the pairs; y records from y corresponding to the pairs; position position of the selected pairs; index index of the pairs in pairs.

Greedy one-to-one matching of pairs

Description

Greedy one-to-one matching of pairs

Usage

greedy(x, y, weight, n = 1L, m = 1L, include_ties = FALSE)
greedy(x, y, weight, n = 1L, m = 1L, include_ties = FALSE)

Arguments

`x`	id's of lhs of pairs; converted to integer
`y`	id's of rhs of pairs; converted to integer
`weight`	numeric vector with weight of pair
`n`	an integer. Each element of x can be linked to at most n elements of y.
`m`	an integer. Each element of y can be linked to at most m elements of x.
`include_ties`	when pairs for a given record have an equal weight, should all pairs be included.

Details

Pairs with the highest weight are selected as long a neither the lhs as the rhs are already selected in a pair with a higher weight. When include_ties is TRUE all pairs are included when multiple pairs for a given record have an equal weight.

Value

A logical vector with the same length as x.

Use the selected pairs to generate a linked data set

Description

Use the selected pairs to generate a linked data set

Usage

link(
  pairs,
  selection = NULL,
  all = FALSE,
  all_x = all,
  all_y = all,
  x = attr(pairs, "x"),
  y = attr(pairs, "y"),
  suffixes = c(".x", ".y"),
  keep_from_pairs = c(".x", ".y")
)
link(
  pairs,
  selection = NULL,
  all = FALSE,
  all_x = all,
  all_y = all,
  x = attr(pairs, "x"),
  y = attr(pairs, "y"),
  suffixes = c(".x", ".y"),
  keep_from_pairs = c(".x", ".y")
)

Arguments

`pairs`	a `pairs` object, such as generated by `pair_blocking`
`selection`	a logical variable with the same length as `pairs` has rows, or the name of such a variable in `pairs`. Pairs are only selected when `select` is `TRUE`. When missing `attr(pairs, "selection")` is used when available.
`all`	return all records from `x` and `y`; even those that don't match.
`all_x`	return all records from `x`.
`all_y`	return all records from `y`.
`x`	the first data set; when missing `attr(pairs, "x")` is used.
`y`	the second data set; when missing `attr(pairs, "y")` is used.
`suffixes`	a character vector of length 2 specifying the suffixes to be used for making unique the names of columns in the result.
`keep_from_pairs`	character vector with names of variables in `pairs` that should be included in the output.

Details

Uses the selected pairs to link the two data sets to each other. Renames variables that are in both data sets.

Value

Returns a data.table containing records from x and y and pairs. Columns that occur both in x and y gain a suffix indicating from which data set they are.

Tiny example dataset for probabilistic linkage

Description

Contains fictional records of 7 persons.

Format

Two data frames with resp. 6 and 5 records and 6 columns.

Details

id the id of the person; this contains no errors and can be used to validate the linkage.
lastname the last name of the person; contains errors.
firstname the first name of the persons; contains errors.
address the address; contains errors.
sex the sex; contains errors and missing values.
postcode the postcode; contains no errors.

Force n to m matching on a set of pairs

Description

Force n to m matching on a set of pairs

Usage

match_n_to_m(x, y, w, n = 1, m = 1)
match_n_to_m(x, y, w, n = 1, m = 1)

Arguments

`x`	a vector of identifiers for each x in each pair This vector should have a unique value for each element in x.
`y`	a vector of identifiers for each y in each pair This vector should have a unique value for each element in y.
`w`	a vector with weights for each pair. The algorithm will try to maximise the total weight of the selected pairs.
`n`	an integer. Each element of x can be linked to at most n elements of y.
`m`	an integer. Each element of y can be linked to at most m elements of x.

Details

The algorithm will try to select pairs in such a way each element of x is matched to at most n elements of y and that each element of y is matched at most m elements of x. It tries to select elements in such a way that the total weight w of the selected elements is maximised.

Value

A logical vector with the same length as x indicating the selected records.

Examples

d <- data.frame(x=c(1,1,1,2,2,3,3), y=c(1,2,3,4,5,6,7), w=1:7)
# One-to-one matching:
d[match_n_to_m(d$x, d$y, d$w), ]

# N-to-one matching:
d[match_n_to_m(d$x, d$y, d$w, n=999), ]

# One-to-m matching:
d[match_n_to_m(d$x, d$y, d$w, m=999), ]

# N-to-M matching, e.g. select all pairs
d[match_n_to_m(d$x, d$y, d$w, n=999, m=999), ]



d <- data.frame(x=c(1,1,1,2,2,3,3), y=c(1,2,3,4,5,6,7), w=1:7)
# One-to-one matching:
d[match_n_to_m(d$x, d$y, d$w), ]

# N-to-one matching:
d[match_n_to_m(d$x, d$y, d$w, n=999), ]

# One-to-m matching:
d[match_n_to_m(d$x, d$y, d$w, m=999), ]

# N-to-M matching, e.g. select all pairs
d[match_n_to_m(d$x, d$y, d$w, n=999, m=999), ]

Merge two sets of pairs into one

Description

Merge two sets of pairs into one

Usage

## S3 method for class 'cluster_pairs'
merge_pairs(
  pairs1,
  pairs2,
  name = paste(pairs1$name, pairs2$name, sep = "+"),
  ...
)

## S3 method for class 'cluster_pairs'
rbind(...)

merge_pairs(pairs1, pairs2, ...)

## S3 method for class 'pairs'
merge_pairs(pairs1, pairs2, ...)

## S3 method for class 'pairs'
rbind(...)
## S3 method for class 'cluster_pairs'
merge_pairs(
  pairs1,
  pairs2,
  name = paste(pairs1$name, pairs2$name, sep = "+"),
  ...
)

## S3 method for class 'cluster_pairs'
rbind(...)

merge_pairs(pairs1, pairs2, ...)

## S3 method for class 'pairs'
merge_pairs(pairs1, pairs2, ...)

## S3 method for class 'pairs'
rbind(...)

Arguments

`pairs1`	the first set of pairs
`pairs2`	the second set of pairs
`name`	name of new object to assign the pairs to on the cluster nodes.
`...`	for `rbind` the `pairs` or `cluster_pairs` objects the need to be combined; for `merge_pairs` these are passed on to other methods.

Details

The function will give an error when the two sets of pairs have different values for attr(pairs1, "x") and attr(pairs1, "y"). When these attributes are missing the code will not generate an error; the user is then responsible for ensuring that the indices in pairs1 and pairs2 refer to the same datasets.

Value

Returns a pairs or cluster_pairs object where both sets of pairs are combined. Duplicate pairs are removed.

In case of merge_pairs.cluster_pairs, merge_pairs.pairs is called on each cluster node and the resulting pairs are assigned to name in the environment reclin_env.

Generate all possible pairs

Description

Generates all combinations of records from x and y.

Usage

pair(x, y, deduplication = FALSE, add_xy = TRUE)
pair(x, y, deduplication = FALSE, add_xy = TRUE)

Arguments

`x`	first `data.frame`
`y`	second `data.frame`. Ignored when `deduplication = TRUE`.
`deduplication`	generate pairs from only `x`. Ignore `y`. This is usefull for deduplication of `x`.
`add_xy`	add `x` and `y` as attributes to the returned pairs. This makes calling some subsequent operations that need `x` and `y` (such as `compare_pairs` easier.

Details

Generating (all) pairs of the records of two data sets, is usually the first step when linking the two data sets.

Value

A data.table with two columns, .x and .y, is returned. Columns .x and .y are row numbers from data.frames .x and .y respectively.

Examples

data("linkexample1", "linkexample2")
pairs <- pair(linkexample1, linkexample2)

data("linkexample1", "linkexample2")
pairs <- pair(linkexample1, linkexample2)

Generate pairs using simple blocking

Description

Generates all combinations of records from x and y where the blocking variables are equal.

Usage

pair_blocking(x, y, on, deduplication = FALSE, add_xy = TRUE)
pair_blocking(x, y, on, deduplication = FALSE, add_xy = TRUE)

Arguments

`x`	first `data.frame`
`y`	second `data.frame`. Ignored when `deduplication = TRUE`.
`on`	the variables defining the blocks or strata for which all pairs of `x` and `y` will be generated.
`deduplication`	generate pairs from only `x`. Ignore `y`. This is usefull for deduplication of `x`.
`add_xy`	add `x` and `y` as attributes to the returned pairs. This makes calling some subsequent operations that need `x` and `y` (such as `compare_pairs` easier.

Details

Value

A data.table with two columns, .x and .y, is returned. Columns .x and .y are row numbers from data.frames .x and .y respectively.

Examples

data("linkexample1", "linkexample2")
pairs <- pair_blocking(linkexample1, linkexample2, "postcode")

data("linkexample1", "linkexample2")
pairs <- pair_blocking(linkexample1, linkexample2, "postcode")

Generate pairs with a minimal similarity

Description

Generates all combinations of records from x and y where the blocking variables are equal.

Usage

pair_minsim(
  x,
  y,
  on,
  minsim = 0,
  on_blocking = character(0),
  comparators = list(default_comparator),
  default_comparator = cmp_identical(),
  keep_simsum = TRUE,
  deduplication = FALSE,
  add_xy = TRUE
)
pair_minsim(
  x,
  y,
  on,
  minsim = 0,
  on_blocking = character(0),
  comparators = list(default_comparator),
  default_comparator = cmp_identical(),
  keep_simsum = TRUE,
  deduplication = FALSE,
  add_xy = TRUE
)

Arguments

`x`	first `data.frame`
`y`	second `data.frame`. Ignored when `deduplication = TRUE`.
`on`	the variables defining on which the pairs of records from `x` and `y` are compared.
`minsim`	minimal similarity score.
`on_blocking`	variables for which the pairs have to match.
`comparators`	named list of functions with which the variables are compared. This function should accept two vectors. Function should either return a vector or a `data.table` with multiple columns.
`default_comparator`	variables for which no comparison function is defined using `comparators` is compares with the function `default_comparator`.
`keep_simsum`	add a variable `minsim` to the result with the similarity score of the pair.
`deduplication`	generate pairs from only `x`. Ignore `y`. This is usefull for deduplication of `x`.
`add_xy`	add `x` and `y` as attributes to the returned pairs. This makes calling some subsequent operations that need `x` and `y` (such as `compare_pairs` easier.

Details

Missing values in the variables on which the pairs are compared count as a similarity of 0.

Value

A data.table with two columns, .x and .y, is returned. Columns .x and .y are row numbers from data.frames .x and .y respectively.

Examples

data("linkexample1", "linkexample2")
pairs <- pair_minsim(linkexample1, linkexample2, 
   on = c("postcode", "address"), minsim = 1)
# Either address or postcode has to match to keep a pair

data("linkexample1", "linkexample2")
pairs <- pair_minsim(linkexample1, linkexample2, on_blocking = "postcode",
   on = c("lastname", "firstname", "address"), minsim = 2)
# Postcode has to match; from lastname, firstname, address there have to match
# two or more (e.g. one mismatch is allowed).

data("linkexample1", "linkexample2")
pairs <- pair_minsim(linkexample1, linkexample2, 
   on = c("postcode", "address"), minsim = 1)
# Either address or postcode has to match to keep a pair

data("linkexample1", "linkexample2")
pairs <- pair_minsim(linkexample1, linkexample2, on_blocking = "postcode",
   on = c("lastname", "firstname", "address"), minsim = 2)
# Postcode has to match; from lastname, firstname, address there have to match
# two or more (e.g. one mismatch is allowed).

Calculate weights and probabilities for pairs

Description

Calculate weights and probabilities for pairs

Usage

## S3 method for class 'problink_em'
predict(
  object,
  pairs = newdata,
  newdata = NULL,
  type = c("weights", "mpost", "probs", "all"),
  binary = FALSE,
  add = FALSE,
  comparators,
  inplace = FALSE,
  new_name = NULL,
  ...
)
## S3 method for class 'problink_em'
predict(
  object,
  pairs = newdata,
  newdata = NULL,
  type = c("weights", "mpost", "probs", "all"),
  binary = FALSE,
  add = FALSE,
  comparators,
  inplace = FALSE,
  new_name = NULL,
  ...
)

Arguments

`object`	an object of type `problink_em` as produced by `problink_em`.
`pairs`	a object with pairs for which to calculate weights.
`newdata`	an alternative name for the `pairs` argument. Specify `newdata` or `pairs`.
`type`	a character vector of length one specifying what to calculate. See results for more information.
`binary`	convert comparison vectors to binary vectors using the comparison function in comparators.
`add`	add the predictions to the original pairs object.
`comparators`	a list of comparison functions (see `compare_pairs`). When missing `attr(pairs, 'comparators')` is used.
`inplace`	logical indicating whether `pairs` should be modified in place. When pairs is large this can be more efficient.
`new_name`	name of new object to assign the pairs to on the cluster nodes (only relevant when pairs is of type `cluster_pairs`.
`...`	unused.

Value

When pairs is of type pairs, returns a data.table with either the .x and .y columns from pairs (when add = FALSE) or all columns of pairs. To these columns are added:

In case of type = "weights" a column weights with the calculated weights.
In case of type = "mpost" a column mpost with the calculated posterior probabilities (probability that pair is a match given comparison vector.
In case of type = "prob" the columns mprob and uprob with the m and u-probabilites and mpost and upost with the posterior m- and u-probabilities.
In case of type = "all" all of the above.

Calculate EM-estimates of m- and u-probabilities

Description

Calculate EM-estimates of m- and u-probabilities

Usage

problink_em(
  formula,
  data,
  patterns,
  mprobs0 = list(0.95),
  uprobs0 = list(0.02),
  p0 = 0.05,
  tol = 1e-05,
  mprob_max = 0.999,
  uprob_min = 1e-04
)
problink_em(
  formula,
  data,
  patterns,
  mprobs0 = list(0.95),
  uprobs0 = list(0.02),
  p0 = 0.05,
  tol = 1e-05,
  mprob_max = 0.999,
  uprob_min = 1e-04
)

Arguments

`formula`	a formula object with the variables for which to calculate the m- and u-probabilities. Should be of the form `~ var1 + var2`.
`data`	data set with pairs on which to estimate the model. Alternatively one can use the `patterns` argument.
`patterns`	table of patterns (as output by `tabulate_patterns`).
`mprobs0`, `uprobs0`	initial values of the m- and u-probabilities. These should be lists with numeric values. The names of the elements in the list should correspond to the names in `by_x` in `compare_pairs`.
`p0`	the initial estimate of the probability that a pair is a match.
`tol`	when the change in the m and u-probabilities is smaller than `tol` the algorithm is stopped.
`mprob_max`	maximum values of the estimated m-probabilities. Values equal to one can lead to numerical instabilities.
`uprob_min`	maximum values of the estimated m-probabilities. Values equal to zero can lead to numerical instabilities.

Value

Returns an object of type problink_em. This is a list containing the estimated mprobs, uprobs and overall linkage probability p. It also contains the table of comparison patterns.

References

Fellegi, I. and A. Sunter (1969). "A Theory for Record Linkage", Journal of the American Statistical Association. 64 (328): pp. 1183-1210. doi:10.2307/2286061.

Herzog, T.N., F.J. Scheuren and W.E. Winkler (2007). Data Quality and Record Linkage Techniques, Springer.

Examples

data("linkexample1", "linkexample2")
pairs <- pair_blocking(linkexample1, linkexample2, "postcode")
pairs <- compare_pairs(pairs, c("lastname", "firstname", "address", "sex"))
model <- problink_em(~ lastname + firstname + address + sex, data = pairs)
summary(model)

data("linkexample1", "linkexample2")
pairs <- pair_blocking(linkexample1, linkexample2, "postcode")
pairs <- compare_pairs(pairs, c("lastname", "firstname", "address", "sex"))
model <- problink_em(~ lastname + firstname + address + sex, data = pairs)
summary(model)

Score pairs based on a number of comparison vectors

Description

Score pairs based on a number of comparison vectors

Usage

## S3 method for class 'cluster_pairs'
score_simple(
  pairs,
  variable,
  on,
  w1 = 1,
  w0 = 0,
  wna = 0,
  new_name = NULL,
  ...
)

score_simple(pairs, variable, on, w1 = 1, w0 = 0, wna = 0, ...)

## S3 method for class 'pairs'
score_simple(
  pairs,
  variable,
  on,
  w1 = 1,
  w0 = 0,
  wna = 0,
  inplace = FALSE,
  ...
)
## S3 method for class 'cluster_pairs'
score_simple(
  pairs,
  variable,
  on,
  w1 = 1,
  w0 = 0,
  wna = 0,
  new_name = NULL,
  ...
)

score_simple(pairs, variable, on, w1 = 1, w0 = 0, wna = 0, ...)

## S3 method for class 'pairs'
score_simple(
  pairs,
  variable,
  on,
  w1 = 1,
  w0 = 0,
  wna = 0,
  inplace = FALSE,
  ...
)

Arguments

`pairs`	a `pairs` object, such as generated by `pair_blocking`
`variable`	the name of the new variable to create in pairs. This will be a logical variable with a value of `TRUE` for the selected pairs.
`on`	character vector of variables on which the score should be based.
`w1`	a vector or list with weights for agreement for each of the variables. It can either be a numeric vector of length 1 in which case the same weight is used for all variables; A numeric vector of length equal to the length of `on` in which case the weights correspond one-to-one to the variables in `on`; A named numeric vector where the names correspond to those in `on`, missing values are assigned a value of 1; or a named list with numeric values. See details for more information.
`w0`	a vector or list with weights for non-agreement for each of the variables. See details for more information. For the format see `w1`.
`wna`	a vector or list with weights for agreement for each of the variables. See details for more information. For the format see `w1`.
`new_name`	name of new object to assign the pairs to on the cluster nodes.
`...`	ignored
`inplace`	logical indicating whether `pairs` should be modified in place. When pairs is large this can be more efficient.

Details

The individual contribution of a variable x to the total score is given by x * w1 + (1-x) * w0 in case of non-NA values and wna in case of NA. This assumes that the values 1 corresponds to complete agreement and the value 0 to complete non-agreement. In case of complete agreement a variable contributes w1 to the total score and in case of complete non-agreement it contributes w0 to the total score.

Value

Returns the data.table pairs with the column variable added in case of score_simple.pairs.

In case of score_simple.cluster_pairs, score_simple.pairs is called on each cluster node and the resulting pairs are assigned to new_name in the environment reclin_env. When new_name is not given (or equal to NULL) the original pairs on the nodes are overwritten.

Examples

data("linkexample1", "linkexample2")
pairs <- pair_blocking(linkexample1, linkexample2, "postcode")
compare_pairs(pairs, on = c("firstname", "lastname", "sex"), inplace = TRUE)

score_simple(pairs, "score", on = c("firstname", "lastname", "sex"))

# Change the default weights
score_simple(pairs, "score", on = c("firstname", "lastname", "sex"), 
  w1 = 2, w0 = -1, wna = NA)

# Use a named vector; omited elements from w1 get a weight of 1; those from
# w0 and wna a weight of 0.
score_simple(pairs, "score", on = c("firstname", "lastname", "sex"), 
  w1 = c("firstname" = 2, "lastname" = 3), 
  w0 = c("firstname" = -1, "lastname" = -0.5))

# Use a named list; omited elements from w1 get a weight of 1; those from
# w0 and wna a weight of 0.
score_simple(pairs, "score", on = c("firstname", "lastname", "sex"), 
  w1 = list("firstname" = 2, "lastname" = 3), 
  w0 = list("firstname" = -1, "lastname" = -0.5))

data("linkexample1", "linkexample2")
pairs <- pair_blocking(linkexample1, linkexample2, "postcode")
compare_pairs(pairs, on = c("firstname", "lastname", "sex"), inplace = TRUE)

score_simple(pairs, "score", on = c("firstname", "lastname", "sex"))

# Change the default weights
score_simple(pairs, "score", on = c("firstname", "lastname", "sex"), 
  w1 = 2, w0 = -1, wna = NA)

# Use a named vector; omited elements from w1 get a weight of 1; those from
# w0 and wna a weight of 0.
score_simple(pairs, "score", on = c("firstname", "lastname", "sex"), 
  w1 = c("firstname" = 2, "lastname" = 3), 
  w0 = c("firstname" = -1, "lastname" = -0.5))

# Use a named list; omited elements from w1 get a weight of 1; those from
# w0 and wna a weight of 0.
score_simple(pairs, "score", on = c("firstname", "lastname", "sex"), 
  w1 = list("firstname" = 2, "lastname" = 3), 
  w0 = list("firstname" = -1, "lastname" = -0.5))

Select matching pairs enforcing one-to-one linkage

Description

Select matching pairs enforcing one-to-one linkage

Usage

## S3 method for class 'cluster_pairs'
select_greedy(
  pairs,
  variable,
  score,
  threshold,
  preselect = NULL,
  id_x = NULL,
  id_y = NULL,
  ...
)

## S3 method for class 'cluster_pairs'
select_n_to_m(
  pairs,
  variable,
  score,
  threshold,
  preselect = NULL,
  id_x = NULL,
  id_y = NULL,
  ...
)

select_greedy(
  pairs,
  variable,
  score,
  threshold,
  preselect = NULL,
  id_x = NULL,
  id_y = NULL,
  ...
)

## S3 method for class 'pairs'
select_greedy(
  pairs,
  variable,
  score,
  threshold,
  preselect = NULL,
  id_x = NULL,
  id_y = NULL,
  x = attr(pairs, "x"),
  y = attr(pairs, "y"),
  inplace = FALSE,
  include_ties = FALSE,
  n = 1L,
  m = 1L,
  ...
)

select_n_to_m(
  pairs,
  variable,
  score,
  threshold,
  preselect = NULL,
  id_x = NULL,
  id_y = NULL,
  ...
)

## S3 method for class 'pairs'
select_n_to_m(
  pairs,
  variable,
  score,
  threshold,
  preselect = NULL,
  id_x = NULL,
  id_y = NULL,
  x = attr(pairs, "x"),
  y = attr(pairs, "y"),
  inplace = FALSE,
  ...
)
## S3 method for class 'cluster_pairs'
select_greedy(
  pairs,
  variable,
  score,
  threshold,
  preselect = NULL,
  id_x = NULL,
  id_y = NULL,
  ...
)

## S3 method for class 'cluster_pairs'
select_n_to_m(
  pairs,
  variable,
  score,
  threshold,
  preselect = NULL,
  id_x = NULL,
  id_y = NULL,
  ...
)

select_greedy(
  pairs,
  variable,
  score,
  threshold,
  preselect = NULL,
  id_x = NULL,
  id_y = NULL,
  ...
)

## S3 method for class 'pairs'
select_greedy(
  pairs,
  variable,
  score,
  threshold,
  preselect = NULL,
  id_x = NULL,
  id_y = NULL,
  x = attr(pairs, "x"),
  y = attr(pairs, "y"),
  inplace = FALSE,
  include_ties = FALSE,
  n = 1L,
  m = 1L,
  ...
)

select_n_to_m(
  pairs,
  variable,
  score,
  threshold,
  preselect = NULL,
  id_x = NULL,
  id_y = NULL,
  ...
)

## S3 method for class 'pairs'
select_n_to_m(
  pairs,
  variable,
  score,
  threshold,
  preselect = NULL,
  id_x = NULL,
  id_y = NULL,
  x = attr(pairs, "x"),
  y = attr(pairs, "y"),
  inplace = FALSE,
  ...
)

Arguments

`pairs`	a `pairs` object, such as generated by `pair_blocking`
`variable`	the name of the new variable to create in pairs. This will be a logical variable with a value of `TRUE` for the selected pairs.
`score`	name of the score/weight variable of the pairs. When not given and `attr(pairs, "score")` is defined, that is used.
`threshold`	the threshold to apply. Pairs with a score above the threshold are selected.
`preselect`	a logical variable with the same length as `pairs` has rows, or the name of such a variable in `pairs`. Pairs are only selected when `preselect` is `TRUE`. This interacts with `threshold` (pairs have to be selected with both conditions).
`id_x`	a integer vector with the same length as the number of rows in `pairs`, or the name of a column in `x`. This vector should identify unique objects in `x`. When not specified it is assumed that each element in `x` is unique.
`id_y`	a integer vector with the same length as the number of rows in `pairs`, or the name of a column in `y`. This vector should identify unique objects in `y`. When not specified it is assumed that each element in `y` is unique.
`...`	Used to pass additional arguments to methods
`x`	`data.table` with one half of the pairs.
`y`	`data.table` with the other half of the pairs.
`inplace`	logical indicating whether `pairs` should be modified in place. When pairs is large this can be more efficient.
`include_ties`	when pairs for a given record have an equal weight, should all pairs be included.
`n`	an integer. Each element of x can be linked to at most n elements of y.
`m`	an integer. Each element of y can be linked to at most m elements of x.

Details

Both methods force one-to-one matching. select_greedy uses a greedy algorithm that selects the first pair with the highest weight. select_n_to_m tries to optimise the total weight of all of the selected pairs. In general this will result in a better selection. However, select_n_to_m uses much more memory and is much slower and, therefore, can only be used when the number of possible pairs is not too large.

Note that when include_ties = TRUE the same record can still be selected more than once. In that case the pairs will have an equal weight.

Value

Returns the pairs with the variable given by variable added. This is a logical variable indicating which pairs are selected as matches.

Examples

data("linkexample1", "linkexample2")
pairs <- pair_blocking(linkexample1, linkexample2, "postcode")
pairs <- compare_pairs(pairs, c("lastname", "firstname", "address", "sex"))
model <- problink_em(~ lastname + firstname + address + sex, data = pairs)
pairs <- predict(model, pairs, type = "mpost", add = TRUE, binary = TRUE)

# Select pairs with a mpost > 0.5 and force one-to-one linkage
pairs <- select_n_to_m(pairs, "ntom", "mpost", 0.5)
pairs <- select_greedy(pairs, "greedy", "mpost", 0.5)
table(pairs$ntom, pairs$greedy)

# The same example as above using a cluster;
library(parallel)
cl <- makeCluster(2)

pairs <- cluster_pair_blocking(cl, linkexample1, linkexample2, "postcode")
compare_pairs(pairs, c("lastname", "firstname", "address", "sex"))
model <- problink_em(~ lastname + firstname + address + sex, data = pairs)
predict(model, pairs, type = "mpost", add = TRUE, binary = TRUE)
# Select pairs with a mpost > 0.5 and force one-to-one linkage
# select_n_to_m and select_greedy only work on pairs that are local; 
# therefore we first collect the pairs
select_threshold(pairs, "selected", "mpost", 0.5)
local_pairs <- cluster_collect(pairs, "selected")
local_pairs <- select_n_to_m(local_pairs, "ntom", "mpost", 0.5)
local_pairs <- select_greedy(local_pairs, "greedy", "mpost", 0.5)
table(local_pairs$ntom, local_pairs$greedy)

stopCluster(cl)

data("linkexample1", "linkexample2")
pairs <- pair_blocking(linkexample1, linkexample2, "postcode")
pairs <- compare_pairs(pairs, c("lastname", "firstname", "address", "sex"))
model <- problink_em(~ lastname + firstname + address + sex, data = pairs)
pairs <- predict(model, pairs, type = "mpost", add = TRUE, binary = TRUE)

# Select pairs with a mpost > 0.5 and force one-to-one linkage
pairs <- select_n_to_m(pairs, "ntom", "mpost", 0.5)
pairs <- select_greedy(pairs, "greedy", "mpost", 0.5)
table(pairs$ntom, pairs$greedy)

# The same example as above using a cluster;
library(parallel)
cl <- makeCluster(2)

pairs <- cluster_pair_blocking(cl, linkexample1, linkexample2, "postcode")
compare_pairs(pairs, c("lastname", "firstname", "address", "sex"))
model <- problink_em(~ lastname + firstname + address + sex, data = pairs)
predict(model, pairs, type = "mpost", add = TRUE, binary = TRUE)
# Select pairs with a mpost > 0.5 and force one-to-one linkage
# select_n_to_m and select_greedy only work on pairs that are local; 
# therefore we first collect the pairs
select_threshold(pairs, "selected", "mpost", 0.5)
local_pairs <- cluster_collect(pairs, "selected")
local_pairs <- select_n_to_m(local_pairs, "ntom", "mpost", 0.5)
local_pairs <- select_greedy(local_pairs, "greedy", "mpost", 0.5)
table(local_pairs$ntom, local_pairs$greedy)

stopCluster(cl)

Select matching pairs with a score above or equal to a threshold

Description

Select matching pairs with a score above or equal to a threshold

Usage

## S3 method for class 'cluster_pairs'
select_threshold(pairs, variable, score, threshold, new_name = NULL, ...)

select_threshold(pairs, variable, score, threshold, ...)

## S3 method for class 'pairs'
select_threshold(pairs, variable, score, threshold, inplace = FALSE, ...)
## S3 method for class 'cluster_pairs'
select_threshold(pairs, variable, score, threshold, new_name = NULL, ...)

select_threshold(pairs, variable, score, threshold, ...)

## S3 method for class 'pairs'
select_threshold(pairs, variable, score, threshold, inplace = FALSE, ...)

Arguments

`pairs`	a `pairs` object, such as generated by `pair_blocking`
`variable`	the name of the new variable to create in pairs. This will be a logical variable with a value of `TRUE` for the selected pairs.
`score`	name of the score/weight variable of the pairs. When not given and `attr(pairs, "score")` is defined, that is used.
`threshold`	the threshold to apply. Pairs with a score above or equal to the threshold are selected.
`new_name`	name of new object to assign the pairs to on the cluster nodes.
`...`	ignored
`inplace`	logical indicating whether `pairs` should be modified in place. When pairs is large this can be more efficient.

Value

Returns the pairs with the variable given by variable added. This is a logical variable indicating which pairs are selected a matches.

Examples

data("linkexample1", "linkexample2")
pairs <- pair_blocking(linkexample1, linkexample2, "postcode")
pairs <- compare_pairs(pairs, c("lastname", "firstname", "address", "sex"))
model <- problink_em(~ lastname + firstname + address + sex, data = pairs)
pairs <- predict(model, pairs, type = "mpost", add = TRUE, binary = TRUE)
# Select pairs with a mpost > 0.5
select_threshold(pairs, "selected", "mpost", 0.5, inplace = TRUE)

# Example using cluster;
# In general the syntax is exactly the same except for the first call to 
# to cluster_pair. Note the in general `inplace = TRUE` is implied when
# working with a cluster; therefore the assignment back to pairs can be 
# omitted (also not a problem if it is not).
library(parallel)
data("linkexample1", "linkexample2")
cl <- makeCluster(2)

pairs <- cluster_pair(cl, linkexample1, linkexample2)
compare_pairs(pairs, c("lastname", "firstname", "address", "sex"))
model <- problink_em(~ lastname + firstname + address + sex, data = pairs)
predict(model, pairs, type = "mpost", add = TRUE, binary = TRUE)
# Select pairs with a mpost > 0.5
# Unlike the regular pairs: inplace = TRUE is implied here
select_threshold(pairs, "selected", "mpost", 0.5)
stopCluster(cl)

data("linkexample1", "linkexample2")
pairs <- pair_blocking(linkexample1, linkexample2, "postcode")
pairs <- compare_pairs(pairs, c("lastname", "firstname", "address", "sex"))
model <- problink_em(~ lastname + firstname + address + sex, data = pairs)
pairs <- predict(model, pairs, type = "mpost", add = TRUE, binary = TRUE)
# Select pairs with a mpost > 0.5
select_threshold(pairs, "selected", "mpost", 0.5, inplace = TRUE)

# Example using cluster;
# In general the syntax is exactly the same except for the first call to 
# to cluster_pair. Note the in general `inplace = TRUE` is implied when
# working with a cluster; therefore the assignment back to pairs can be 
# omitted (also not a problem if it is not).
library(parallel)
data("linkexample1", "linkexample2")
cl <- makeCluster(2)

pairs <- cluster_pair(cl, linkexample1, linkexample2)
compare_pairs(pairs, c("lastname", "firstname", "address", "sex"))
model <- problink_em(~ lastname + firstname + address + sex, data = pairs)
predict(model, pairs, type = "mpost", add = TRUE, binary = TRUE)
# Select pairs with a mpost > 0.5
# Unlike the regular pairs: inplace = TRUE is implied here
select_threshold(pairs, "selected", "mpost", 0.5)
stopCluster(cl)

Deselect pairs that are linked to multiple records

Description

Deselect pairs that are linked to multiple records

Usage

## S3 method for class 'cluster_pairs'
select_unique(
  pairs,
  variable,
  preselect = NULL,
  n = 1,
  m = 1,
  id_x = NULL,
  id_y = NULL,
  ...
)

select_unique(
  pairs,
  variable,
  preselect = NULL,
  n = 1,
  m = 1,
  id_x = NULL,
  id_y = NULL,
  ...
)

## S3 method for class 'pairs'
select_unique(
  pairs,
  variable,
  preselect = NULL,
  n = 1,
  m = 1,
  id_x = NULL,
  id_y = NULL,
  x = attr(pairs, "x"),
  y = attr(pairs, "y"),
  inplace = FALSE,
  ...
)
## S3 method for class 'cluster_pairs'
select_unique(
  pairs,
  variable,
  preselect = NULL,
  n = 1,
  m = 1,
  id_x = NULL,
  id_y = NULL,
  ...
)

select_unique(
  pairs,
  variable,
  preselect = NULL,
  n = 1,
  m = 1,
  id_x = NULL,
  id_y = NULL,
  ...
)

## S3 method for class 'pairs'
select_unique(
  pairs,
  variable,
  preselect = NULL,
  n = 1,
  m = 1,
  id_x = NULL,
  id_y = NULL,
  x = attr(pairs, "x"),
  y = attr(pairs, "y"),
  inplace = FALSE,
  ...
)

Arguments

`pairs`	a `pairs` object, such as generated by `pair_blocking`
`variable`	the name of the new variable to create in pairs. This will be a logical variable with a value of `TRUE` for the selected pairs.
`preselect`	a logical variable with the same length as `pairs` has rows, or the name of such a variable in `pairs`. Pairs are only selected when `preselect` is `TRUE`.
`n`	do not select pairs with a y-record that is linked to more than `n` records.
`m`	do not select pairs with a m-record that is linked to more than `m` records.
`id_x`	a integer vector with the same length as the number of rows in `pairs`, or the name of a column in `x`. This vector should identify unique objects in `x`. When not specified it is assumed that each element in `x` is unique.
`id_y`	a integer vector with the same length as the number of rows in `pairs`, or the name of a column in `y`. This vector should identify unique objects in `y`. When not specified it is assumed that each element in `y` is unique.
`...`	Used to pass additional arguments to methods
`x`	`data.table` with one half of the pairs.
`y`	`data.table` with the other half of the pairs.
`inplace`	logical indicating whether `pairs` should be modified in place. When pairs is large this can be more efficient.

Details

This function can be used to remove pairs for which there is ambiguity. For example when a record from x is linked to multiple records from y and we know that there are no duplicate records in y (records that belong to the same object), then we know that at least on of the two links is incorrect but we cannot decide which of the two. In that case we may want to decide that we will not link both records. Running select_unique with m == 1 will remove both records.

In case one wants to select one of the records randomly: select_greedy will select the pair with the highest weight and in case of an equal weight the first. Adding a random component to the weights will ensure a random selection.

Value

Returns the pairs with the variable given by variable added. This is a logical variable indicating which pairs are selected as matches.

Examples

data("linkexample1", "linkexample2")
pairs <- pair_blocking(linkexample1, linkexample2, "postcode")
compare_pairs(pairs, on = c("lastname", "firstname", "address", "sex"),
  default_comparator = jaro_winkler(0.9), inplace = TRUE)
score_simple(pairs, "score", 
  on = c("lastname", "firstname", "address", "sex"),
  w1 = list(lastname = 2), inplace = TRUE)
select_threshold(pairs, variable = "select", 
  score = "score", threshold = 4.0, inplace =  TRUE)
select_unique(pairs, variable = "select_unique", preselect = "select")

data("linkexample1", "linkexample2")
pairs <- pair_blocking(linkexample1, linkexample2, "postcode")
compare_pairs(pairs, on = c("lastname", "firstname", "address", "sex"),
  default_comparator = jaro_winkler(0.9), inplace = TRUE)
score_simple(pairs, "score", 
  on = c("lastname", "firstname", "address", "sex"),
  w1 = list(lastname = 2), inplace = TRUE)
select_threshold(pairs, variable = "select", 
  score = "score", threshold = 4.0, inplace =  TRUE)
select_unique(pairs, variable = "select_unique", preselect = "select")

Summarise the results from `problink_em`

Description

Summarise the results from problink_em

Usage

## S3 method for class 'problink_em'
summary(object, ...)
## S3 method for class 'problink_em'
summary(object, ...)

Arguments

`object`	the `problink_em` object.
`...`	ignored;

Value

Returns the original object with a data.frame with the patterns and corresponding m-, u-probabilities and weights added.

Create a table of comparison patterns

Description

Create a table of comparison patterns

Usage

## S3 method for class 'cluster_pairs'
tabulate_patterns(pairs, on, comparators, complete = TRUE, ...)

tabulate_patterns(pairs, on, comparators, complete = TRUE, ...)

## S3 method for class 'pairs'
tabulate_patterns(pairs, on, comparators, complete = TRUE, ...)
## S3 method for class 'cluster_pairs'
tabulate_patterns(pairs, on, comparators, complete = TRUE, ...)

tabulate_patterns(pairs, on, comparators, complete = TRUE, ...)

## S3 method for class 'pairs'
tabulate_patterns(pairs, on, comparators, complete = TRUE, ...)

Arguments

`pairs`	a `pairs` object, such as generated by `pair_blocking`
`on`	variables from `pairs` defining the comparison patterns. When missing `names(comparators)` is used.
`comparators`	a list with comparison functions for each of the columns. When missing or `NULL`, the function looks for columns in `pairs` with a `comparator` attribute.
`complete`	add patterns that do not occur in the dataset to the result (with `n = 0`).
`...`	passed on to other methods.

Details

Since comparison vectors can contain continuous numbers (usually between 0 and 1), this could result in a very large number of possible comparison vectors. Therefore, the comparison vectors are passed on to the comparators in order to threshold them. This usually results in values 0 or 1. Missing values are usually codes as 0. However, this all depends on the comparison functions used. For more information see the documentation on the comparison functions.

Value

Returns a data.frame with all unique comparison patterns that exist in pairs, with a column n added with the number of times each pattern occurs.

Examples

data("linkexample1", "linkexample2")
pairs <- pair_blocking(linkexample1, linkexample2, "postcode")
pairs <- compare_pairs(pairs, c("lastname", "firstname", "address", "sex"))
tabulate_patterns(pairs)

data("linkexample1", "linkexample2")
pairs <- pair_blocking(linkexample1, linkexample2, "postcode")
pairs <- compare_pairs(pairs, c("lastname", "firstname", "address", "sex"))
tabulate_patterns(pairs)

Spelling variations of a set of town names

Description

Contains spelling variations found in various files of a set of town/village names. Names were selected that contain 'rdam' or 'rdm'. The correct/official names are also given. This data set can be used as an example data set for deduplication

Format

Data frames with 584 records and two columns.

Details

name the name of the town/village as found in the files
official_name the official/correct name

Package 'reclin2'

Help Index

Add a variable from one of the data sets to pairs

Description

Usage

Arguments

Value

Call a function on each of the worker nodes and pass it the pairs

Description

Usage

Arguments

Details

Value

Examples

Collect pairs from cluster nodes

Description

Usage

Arguments

Value

Examples

Call a function on each of the worker nodes to modify the pairs on the node

Description

Usage

Arguments

Details

Value

Examples

Generate all possible pairs using multiple processes

Description

Usage

Arguments

Details

Value

See Also

Examples

Generate pairs using simple blocking using multiple processes

Description

Usage

Arguments

Details

Value

See Also

Examples

Generate pairs with a minimal similarity using multiple processes

Description

Usage

Arguments

Details

Value

See Also

Examples

Comparison functions

Description

Usage

Arguments

Details

Value

Warning

Examples

Compare pairs on a set of variables common in both data sets

Description

Usage

Arguments

Details

Value

Compare pairs on given variables

Description

Usage

Arguments

Details

Value

Deduplication using equivalence groups

Description

Usage

Arguments

Value

Get a subset of pairs to inspect

Description

Usage

Arguments