Title: | Record Linkage Toolkit |
---|---|
Description: | Functions to assist in performing probabilistic record linkage and deduplication: generating pairs, comparing records, em-algorithm for estimating m- and u-probabilities (I. Fellegi & A. Sunter (1969) <doi:10.1080/01621459.1969.10501049>, T.N. Herzog, F.J. Scheuren, & W.E. Winkler (2007), "Data Quality and Record Linkage Techniques", ISBN:978-0-387-69502-0), forcing one-to-one matching. Can also be used for pre- and post-processing for machine learning methods for record linkage. Focus is on memory, CPU performance and flexibility. |
Authors: | Jan van der Laan [aut, cre] |
Maintainer: | Jan van der Laan <[email protected]> |
License: | GPL-3 |
Version: | 0.5.0 |
Built: | 2024-11-05 05:04:06 UTC |
Source: | https://github.com/djvanderlaan/reclin2 |
Add a variable from one of the data sets to pairs
add_from_x(pairs, variable, new_variable = variable, ...) add_from_y(pairs, variable, new_variable = variable, ...)
add_from_x(pairs, variable, new_variable = variable, ...) add_from_y(pairs, variable, new_variable = variable, ...)
pairs |
|
variable |
name of the variable that should be added |
new_variable |
optional variable name of the new variable in
|
... |
other parameters are passed on to |
Returns the pairs with the column added. When inplace = TRUE
pairs
is returned invisibly and the original pairs
is
modified.
Call a function on each of the worker nodes and pass it the pairs
cluster_call(pairs, fun, ...)
cluster_call(pairs, fun, ...)
pairs |
an object or type |
fun |
a function to call on each of the worker nodes. See details on the arguments of this function. |
... |
additional arguments are passed on to |
The function will have to accept the following arguments as its first three arguments:
the data.table
with the pairs of the worker node.
a data.table
with the portion of x
present on the
worker node.
a data.table
with y
.
The function will return a list with for each worker the result of the
function call. When the functions return NULL
the result is returned
invisibly. Because the result is returned to main node, make sure you don't
accidentally return all pairs. If you don't want to return anything end your
function with NULL
.
# Generate some pairs library(parallel) data("linkexample1", "linkexample2") cl <- makeCluster(2) pairs <- cluster_pair(cl, linkexample1, linkexample2) compare_pairs(pairs, c("lastname", "firstname", "address", "sex")) # Add a new column to pairs cluster_call(pairs, function(pairs, ...) { pairs[, name := firstname & lastname] # we don't want to return the pairs; so make sure to return something # else NULL }) # Get the number of pairs on each node lenghts <- cluster_call(pairs, function(pairs, ...) { nrow(pairs) }) lengths <- unlist(lenghts) lenghts # Cleanup stopCluster(cl)
# Generate some pairs library(parallel) data("linkexample1", "linkexample2") cl <- makeCluster(2) pairs <- cluster_pair(cl, linkexample1, linkexample2) compare_pairs(pairs, c("lastname", "firstname", "address", "sex")) # Add a new column to pairs cluster_call(pairs, function(pairs, ...) { pairs[, name := firstname & lastname] # we don't want to return the pairs; so make sure to return something # else NULL }) # Get the number of pairs on each node lenghts <- cluster_call(pairs, function(pairs, ...) { nrow(pairs) }) lengths <- unlist(lenghts) lenghts # Cleanup stopCluster(cl)
Collect pairs from cluster nodes
cluster_collect(pairs, select = NULL, clear = FALSE)
cluster_collect(pairs, select = NULL, clear = FALSE)
pairs |
an object or type |
select |
the name of a logical column that is used to select the pairs that will be collected |
clear |
remove the pairs from the cluster nodes |
Returns an object of type pairs
which is a data.table
. This
object can be used as a regular (non-cluster) set of pairs
library(parallel) data("linkexample1", "linkexample2") cl <- makeCluster(2) pairs <- cluster_pair(cl, linkexample1, linkexample2) local_pairs <- cluster_collect(pairs, clear = FALSE) compare_pairs(pairs, c("lastname", "firstname", "address", "sex")) model <- problink_em(~ lastname + firstname + address + sex, data = pairs) predict(model, pairs, type = "mpost", add = TRUE, binary = TRUE) # Select pairs with a mpost > 0.5 select_threshold(pairs, "selected", "mpost", 0.5) # Collect the selected pairs local_pairs <- cluster_collect(pairs, "selected") stopCluster(cl)
library(parallel) data("linkexample1", "linkexample2") cl <- makeCluster(2) pairs <- cluster_pair(cl, linkexample1, linkexample2) local_pairs <- cluster_collect(pairs, clear = FALSE) compare_pairs(pairs, c("lastname", "firstname", "address", "sex")) model <- problink_em(~ lastname + firstname + address + sex, data = pairs) predict(model, pairs, type = "mpost", add = TRUE, binary = TRUE) # Select pairs with a mpost > 0.5 select_threshold(pairs, "selected", "mpost", 0.5) # Collect the selected pairs local_pairs <- cluster_collect(pairs, "selected") stopCluster(cl)
Call a function on each of the worker nodes to modify the pairs on the node
cluster_modify_pairs(pairs, fun, ..., new_name = NULL)
cluster_modify_pairs(pairs, fun, ..., new_name = NULL)
pairs |
an object or type |
fun |
a function to call on each of the worker nodes. See details on the arguments of this function. |
... |
additional arguments are passed on to |
new_name |
name of new object to assign the pairs to on the cluster nodes. |
The function will have to accept the following arguments as its first three arguments:
the data.table
with the pairs of the worker node.
a data.table
with the portion of x
present on the
worker node.
a data.table
with y
.
The function should either return a data.table
with the new pairs, or
NULL
. When a data.table
is returned this values will replace
the pairs when new_name
is missing or create new pairs in the
environment new_name
. When the function returns NULL
it is
assumed that the function modified the pairs by reference (e.g. using
pairs[, new_var := new_val]
). Note that this also means that
new_name
is ignored.
Will return a cluster_pairs
object. When new_name
is not given
it will return the input pairs
invisibly. Otherwise it will return a
new cluster_pairs
object.
# Generate some pairs library(parallel) data("linkexample1", "linkexample2") cl <- makeCluster(2) pairs <- cluster_pair(cl, linkexample1, linkexample2) compare_pairs(pairs, c("lastname", "firstname", "address", "sex")) # Create a new set of pairs containing a random sample of the original # pairs. sample <- cluster_call(pairs, new_name = "sample", function(pairs, ...) { sel <- sample(nrow(pairs), round(nrow(pairs)*0.1)) pairs[sel, ] }) # Cleanup stopCluster(cl)
# Generate some pairs library(parallel) data("linkexample1", "linkexample2") cl <- makeCluster(2) pairs <- cluster_pair(cl, linkexample1, linkexample2) compare_pairs(pairs, c("lastname", "firstname", "address", "sex")) # Create a new set of pairs containing a random sample of the original # pairs. sample <- cluster_call(pairs, new_name = "sample", function(pairs, ...) { sel <- sample(nrow(pairs), round(nrow(pairs)*0.1)) pairs[sel, ] }) # Cleanup stopCluster(cl)
Generates all combinations of records from x
and y
.
cluster_pair(cluster, x, y, deduplication = FALSE, name = "default")
cluster_pair(cluster, x, y, deduplication = FALSE, name = "default")
cluster |
a cluster object as created by |
x |
first |
y |
second |
deduplication |
generate pairs from only |
name |
the name of the resulting object to create locally on the different R processes. |
Generating (all) pairs of the records of two data sets, is usually the first step when linking the two data sets.
x
is split into length{cluster}
parts which are distributed
over the worker nodes. y
is copied to each of the nodes. On the nodes
then pair
is called. The pairs are stored in the global
object reclin_env
on the nodes in the variable name
. The pairs
can then be further processes using functions such as
compare_pairs
, and tabulate_patterns
. The function
cluster_collect
collects the pairs from each of the nodes.
A object of type cluster_pairs
which is a list
containing the
cluster and the name of the pairs object on the cluster nodes. For the pairs
objects created on the nodes see the documentation of pair
.
cluster_pair_blocking
and cluster_pair_minsim
are
other methods to generate pairs.
library(parallel) data("linkexample1", "linkexample2") cl <- makeCluster(2) pairs <- cluster_pair(cl, linkexample1, linkexample2) stopCluster(cl)
library(parallel) data("linkexample1", "linkexample2") cl <- makeCluster(2) pairs <- cluster_pair(cl, linkexample1, linkexample2) stopCluster(cl)
Generates all combinations of records from x
and y
where the
blocking variables are equal.
cluster_pair_blocking( cluster, x, y, on, deduplication = FALSE, name = "default" )
cluster_pair_blocking( cluster, x, y, on, deduplication = FALSE, name = "default" )
cluster |
a cluster object as created by |
x |
first |
y |
second |
on |
the variables defining the blocks or strata for which
all pairs of |
deduplication |
generate pairs from only |
name |
the name of the resulting object to create locally on the different R processes. |
Generating (all) pairs of the records of two data sets, is usually the first step when linking the two data sets. However, this often results in a too large number of records. Therefore, blocking is usually applied.
x
is split into length{cluster}
parts which are distributed
over the worker nodes. y
is copied to each of the nodes. On the nodes
then pair_blocking
is called. The pairs are stored in the global
object reclin_env
on the nodes in the variable name
. The pairs
can then be further processes using functions such as
compare_pairs
, and tabulate_patterns
. The function
cluster_collect
collects the pairs from each of the nodes.
A object of type cluster_pairs
which is a list
containing the
cluster and the name of the pairs object on the cluster nodes. For the pairs
objects created on the nodes see the documentation of pair
.
cluster_pair
and cluster_pair_minsim
are
other methods to generate pairs.
library(parallel) data("linkexample1", "linkexample2") cl <- makeCluster(2) pairs <- cluster_pair_blocking(cl, linkexample1, linkexample2, "postcode") stopCluster(cl)
library(parallel) data("linkexample1", "linkexample2") cl <- makeCluster(2) pairs <- cluster_pair_blocking(cl, linkexample1, linkexample2, "postcode") stopCluster(cl)
Generates all combinations of records from x
and y
where the
blocking variables are equal.
cluster_pair_minsim( cluster, x, y, on, minsim = 0, on_blocking = character(0), comparators = list(default_comparator), default_comparator = cmp_identical(), keep_simsum = TRUE, deduplication = FALSE, name = "default" )
cluster_pair_minsim( cluster, x, y, on, minsim = 0, on_blocking = character(0), comparators = list(default_comparator), default_comparator = cmp_identical(), keep_simsum = TRUE, deduplication = FALSE, name = "default" )
cluster |
a cluster object as created by |
x |
first |
y |
second |
on |
the variables defining the blocks or strata for which
all pairs of |
minsim |
minimal similarity score. |
on_blocking |
variables for which the pairs have to match. |
comparators |
named list of functions with which the variables are compared.
This function should accept two vectors. Function should either return a vector
or a |
default_comparator |
variables for which no comparison function is defined using
|
keep_simsum |
add a variable |
deduplication |
generate pairs from only |
name |
the name of the resulting object to create locally on the different R processes. |
Generating (all) pairs of the records of two data sets, is usually the first
step when linking the two data sets. However, this often results in a too
large number of records. pair_minsim
will only keep pairs with a
similarity score equal or larger than minsim
. The similarity score is
calculated by summing the results of the comparators for all variables
of on
.
x
is split into length{cluster}
parts which are distributed
over the worker nodes. y
is copied to each of the nodes. On the nodes
then cluster_pair_minsim
is called. The pairs are stored in the global
object reclin_env
on the nodes in the variable name
. The pairs
can then be further processes using functions such as
compare_pairs
, and tabulate_patterns
. The function
cluster_collect
collects the pairs from each of the nodes.
A object of type cluster_pairs
which is a list
containing the
cluster and the name of the pairs object on the cluster nodes. For the pairs
objects created on the nodes see the documentation of pair
.
cluster_pair
and cluster_pair_blocking
are
other methods to generate pairs.
library(parallel) data("linkexample1", "linkexample2") cl <- makeCluster(2) # Either address or postcode has to match to keep a pair pairs <- cluster_pair_minsim(cl, linkexample1, linkexample2, on = c("postcode", "address"), minsim = 1) stopCluster(cl)
library(parallel) data("linkexample1", "linkexample2") cl <- makeCluster(2) # Either address or postcode has to match to keep a pair pairs <- cluster_pair_minsim(cl, linkexample1, linkexample2, on = c("postcode", "address"), minsim = 1) stopCluster(cl)
Comparison functions
cmp_identical() cmp_jarowinkler(threshold = 0.95) jaro_winkler(threshold = 0.8) cmp_lcs(threshold = 0.8) lcs(threshold = 0.8) cmp_jaccard(threshold = 0.8) jaccard(threshold = 0.8)
cmp_identical() cmp_jarowinkler(threshold = 0.95) jaro_winkler(threshold = 0.8) cmp_lcs(threshold = 0.8) lcs(threshold = 0.8) cmp_jaccard(threshold = 0.8) jaccard(threshold = 0.8)
threshold |
threshold to use for the Jaro-Winkler string distance when creating a binary result. |
A comparison function should accept two arguments: both vectors. When the function is called with both arguments it should compare the elements in the first vector to those in the second. When called in this way, both vectors have the same length. What the function should return depends on the methods used to score the pairs. Usually the comparison functions return a similarity score with a value of 0 indication complete difference and a value > 0 indicating similarity (often a value of 1 will indicate perfect similarity).
Some methods, such as problink_em
, can handle similarity
scores, but also need binary values (0
/FALSE
= complete
dissimilarity; 1
/TRUE
= complete similarity). In order to
allow for this the comparison function is called with one argument.
When the comparison is called with one argument, it is passed the result of
a previous comparison. The function should translate that result to a binary
(TRUE
/FALSE
or 1
/0
) result. The result should
not contain missing values.
The jaro_winkler
, lcs
and jaccard
functions use the corresponding
methods from stringdist
except that they are transformed from
a distance to a similarity score.
The functions return a comparison function (see details).
The functions identical
, jaro_winkler
, lcs
and
jaccard
are deprecated and will be removed in future versions of the
package. Instead use the functions cmp_identical
,
cmp_jarowinkler
, cmp_lcs
and cmp_jaccard
.
cmp <- cmp_identical() x <- cmp(c("john", "mary", "susan", "jack"), c("johan", "mary", "susanna", NA)) # Applying the comparison function to the result of the comparison results # in a logical result, with NA's and values of FALSE set to FALSE cmp(x) cmp <- cmp_jarowinkler(0.95) x <- cmp(c("john", "mary", "susan", "jack"), c("johan", "mary", "susanna", NA)) # Applying the comparison function to the result of the comparison results # in a logical result, with NA's and values below the threshold FALSE cmp(x)
cmp <- cmp_identical() x <- cmp(c("john", "mary", "susan", "jack"), c("johan", "mary", "susanna", NA)) # Applying the comparison function to the result of the comparison results # in a logical result, with NA's and values of FALSE set to FALSE cmp(x) cmp <- cmp_jarowinkler(0.95) x <- cmp(c("john", "mary", "susan", "jack"), c("johan", "mary", "susanna", NA)) # Applying the comparison function to the result of the comparison results # in a logical result, with NA's and values below the threshold FALSE cmp(x)
Compare pairs on a set of variables common in both data sets
## S3 method for class 'cluster_pairs' compare_pairs( pairs, on, comparators = list(default_comparator), default_comparator = cmp_identical(), new_name = NULL, ... ) compare_pairs( pairs, on, comparators = list(default_comparator), default_comparator = cmp_identical(), ... ) ## S3 method for class 'pairs' compare_pairs( pairs, on, comparators = list(default_comparator), default_comparator = cmp_identical(), x = attr(pairs, "x"), y = attr(pairs, "y"), inplace = FALSE, ... )
## S3 method for class 'cluster_pairs' compare_pairs( pairs, on, comparators = list(default_comparator), default_comparator = cmp_identical(), new_name = NULL, ... ) compare_pairs( pairs, on, comparators = list(default_comparator), default_comparator = cmp_identical(), ... ) ## S3 method for class 'pairs' compare_pairs( pairs, on, comparators = list(default_comparator), default_comparator = cmp_identical(), x = attr(pairs, "x"), y = attr(pairs, "y"), inplace = FALSE, ... )
pairs |
|
on |
character vector of variables that should be compared. |
comparators |
named list of functions with which the variables are compared.
This function should accept two vectors. Function should either return a vector
or a |
default_comparator |
variables for which no comparison function is defined using
|
new_name |
name of new object to assign the pairs to on the cluster nodes. |
... |
Ignored for now |
x |
|
y |
|
inplace |
logical indicating whether |
It is assumed the variables in on
are present in both x
and y
. Variables
with the same names are added to pairs.
When the comparator
returns a data.table
multiple columns are added to pairs
.
The names of these columns are variable
pasted together with the names of
the data.table
returned by comparator
(separated by "_").
Returns the data.table
pairs
with one or more columns added in
case of compare_pairs.pairs
.
In case of compare_pairs.cluster_pairs
, compare_pair.pairs
is called on
each cluster node and the resulting pairs are assigned to new_name
in
the environment reclin_env
. When new_name
is not given (or
equal to NULL) the original pairs on the nodes are overwritten.
Compare pairs on given variables
## S3 method for class 'cluster_pairs' compare_vars( pairs, variable, on_x = variable, on_y = on_x, comparator = cmp_identical(), new_name = NULL, ... ) compare_vars( pairs, variable, on_x = variable, on_y = on_x, comparator = cmp_identical(), ... ) ## S3 method for class 'pairs' compare_vars( pairs, variable, on_x = variable, on_y = on_x, comparator = cmp_identical(), x = attr(pairs, "x"), y = attr(pairs, "y"), inplace = FALSE, ... )
## S3 method for class 'cluster_pairs' compare_vars( pairs, variable, on_x = variable, on_y = on_x, comparator = cmp_identical(), new_name = NULL, ... ) compare_vars( pairs, variable, on_x = variable, on_y = on_x, comparator = cmp_identical(), ... ) ## S3 method for class 'pairs' compare_vars( pairs, variable, on_x = variable, on_y = on_x, comparator = cmp_identical(), x = attr(pairs, "x"), y = attr(pairs, "y"), inplace = FALSE, ... )
pairs |
|
variable |
character vector with name of resulting column name that is added to pairs. |
on_x |
character vector with the column names from |
on_y |
character vector with the column names from |
comparator |
function with which the variables are compared. When |
new_name |
name of new object to assign the pairs to on the cluster nodes. |
... |
Passed on to the comparator function. |
x |
|
y |
|
inplace |
logical indicating whether |
When comparator
returns a data.table
multiple columns are added to pairs
.
The names of these columns are variable
pasted together with the names of
the data.table
returned by comparator
(separated by "_").
Returns the data.table
pairs
with one or more columns added.
Deduplication using equivalence groups
deduplicate_equivalence(pairs, variable, selection, x = attr(pairs, "x"))
deduplicate_equivalence(pairs, variable, selection, x = attr(pairs, "x"))
pairs |
a |
variable |
name of the variable to create in |
selection |
a logical variable with the same length as |
x |
the first data set; when missing |
Returns x
with a variable containing the group labels. Records with
the same group label (should) correspond to the same entity.
Get a subset of pairs to inspect
get_inspect_pairs( pairs, variable, threshold, position = NULL, n = 11, x = attr(pairs, "x"), y = attr(pairs, "y") )
get_inspect_pairs( pairs, variable, threshold, position = NULL, n = 11, x = attr(pairs, "x"), y = attr(pairs, "y") )
pairs |
|
variable |
name of variable to base the selection on; should be a variable with the similarity score of the pairs. |
threshold |
the threshold around which to select pairs. Used when position is not given. |
position |
select pairs around this position (based on order of
|
n |
number of pairs to select. Pairs are selected symmetric around the theshold. |
x |
|
y |
|
Returns a list with elements pairs
with the selected pairs;
x
records from x
corresponding to the pairs; y
records
from y
corresponding to the pairs; position
position of the
selected pairs; index
index of the pairs in pairs
.
Greedy one-to-one matching of pairs
greedy(x, y, weight, n = 1L, m = 1L, include_ties = FALSE)
greedy(x, y, weight, n = 1L, m = 1L, include_ties = FALSE)
x |
id's of lhs of pairs; converted to integer |
y |
id's of rhs of pairs; converted to integer |
weight |
numeric vector with weight of pair |
n |
an integer. Each element of x can be linked to at most n elements of y. |
m |
an integer. Each element of y can be linked to at most m elements of x. |
include_ties |
when pairs for a given record have an equal weight, should all pairs be included. |
Pairs with the highest weight are selected as long a neither the lhs as the
rhs are already selected in a pair with a higher weight. When include_ties
is TRUE
all pairs are included when multiple pairs for a given record have
an equal weight.
A logical vector with the same length as x
.
Use the selected pairs to generate a linked data set
link( pairs, selection = NULL, all = FALSE, all_x = all, all_y = all, x = attr(pairs, "x"), y = attr(pairs, "y"), suffixes = c(".x", ".y"), keep_from_pairs = c(".x", ".y") )
link( pairs, selection = NULL, all = FALSE, all_x = all, all_y = all, x = attr(pairs, "x"), y = attr(pairs, "y"), suffixes = c(".x", ".y"), keep_from_pairs = c(".x", ".y") )
pairs |
a |
selection |
a logical variable with the same length as |
all |
return all records from |
all_x |
return all records from |
all_y |
return all records from |
x |
the first data set; when missing |
y |
the second data set; when missing |
suffixes |
a character vector of length 2 specifying the suffixes to be used for making unique the names of columns in the result. |
keep_from_pairs |
character vector with names of variables in |
Uses the selected pairs to link the two data sets to each other. Renames variables that are in both data sets.
Returns a data.table
containing records from x
and y
and
pairs
. Columns that occur both in x
and y
gain a suffix
indicating from which data set they are.
Contains fictional records of 7 persons.
Two data frames with resp. 6 and 5 records and 6 columns.
id
the id of the person; this contains no errors and can be used to
validate the linkage.
lastname
the last name of the person; contains errors.
firstname
the first name of the persons; contains errors.
address
the address; contains errors.
sex
the sex; contains errors and missing values.
postcode
the postcode; contains no errors.
Force n to m matching on a set of pairs
match_n_to_m(x, y, w, n = 1, m = 1)
match_n_to_m(x, y, w, n = 1, m = 1)
x |
a vector of identifiers for each x in each pair This vector should have a unique value for each element in x. |
y |
a vector of identifiers for each y in each pair This vector should have a unique value for each element in y. |
w |
a vector with weights for each pair. The algorithm will try to maximise the total weight of the selected pairs. |
n |
an integer. Each element of x can be linked to at most n elements of y. |
m |
an integer. Each element of y can be linked to at most m elements of x. |
The algorithm will try to select pairs in such a way each element of x
is matched to at most n
elements of y
and that each element of
y
is matched at most m
elements of x
. It tries to select
elements in such a way that the total weight w
of the selected
elements is maximised.
A logical vector with the same length as x
indicating the selected
records.
d <- data.frame(x=c(1,1,1,2,2,3,3), y=c(1,2,3,4,5,6,7), w=1:7) # One-to-one matching: d[match_n_to_m(d$x, d$y, d$w), ] # N-to-one matching: d[match_n_to_m(d$x, d$y, d$w, n=999), ] # One-to-m matching: d[match_n_to_m(d$x, d$y, d$w, m=999), ] # N-to-M matching, e.g. select all pairs d[match_n_to_m(d$x, d$y, d$w, n=999, m=999), ]
d <- data.frame(x=c(1,1,1,2,2,3,3), y=c(1,2,3,4,5,6,7), w=1:7) # One-to-one matching: d[match_n_to_m(d$x, d$y, d$w), ] # N-to-one matching: d[match_n_to_m(d$x, d$y, d$w, n=999), ] # One-to-m matching: d[match_n_to_m(d$x, d$y, d$w, m=999), ] # N-to-M matching, e.g. select all pairs d[match_n_to_m(d$x, d$y, d$w, n=999, m=999), ]
Merge two sets of pairs into one
## S3 method for class 'cluster_pairs' merge_pairs( pairs1, pairs2, name = paste(pairs1$name, pairs2$name, sep = "+"), ... ) ## S3 method for class 'cluster_pairs' rbind(...) merge_pairs(pairs1, pairs2, ...) ## S3 method for class 'pairs' merge_pairs(pairs1, pairs2, ...) ## S3 method for class 'pairs' rbind(...)
## S3 method for class 'cluster_pairs' merge_pairs( pairs1, pairs2, name = paste(pairs1$name, pairs2$name, sep = "+"), ... ) ## S3 method for class 'cluster_pairs' rbind(...) merge_pairs(pairs1, pairs2, ...) ## S3 method for class 'pairs' merge_pairs(pairs1, pairs2, ...) ## S3 method for class 'pairs' rbind(...)
pairs1 |
the first set of pairs |
pairs2 |
the second set of pairs |
name |
name of new object to assign the pairs to on the cluster nodes. |
... |
for |
The function will give an error when the two sets of pairs have different values
for attr(pairs1, "x")
and attr(pairs1, "y")
. When these attributes
are missing the code will not generate an error; the user is then
responsible for ensuring that the indices in pairs1
and pairs2
refer to the same datasets.
Returns a pairs
or cluster_pairs
object where both sets of pairs
are combined. Duplicate pairs are removed.
In case of merge_pairs.cluster_pairs
, merge_pairs.pairs
is called on
each cluster node and the resulting pairs are assigned to name
in
the environment reclin_env
.
Generates all combinations of records from x
and y
.
pair(x, y, deduplication = FALSE, add_xy = TRUE)
pair(x, y, deduplication = FALSE, add_xy = TRUE)
x |
first |
y |
second |
deduplication |
generate pairs from only |
add_xy |
add |
Generating (all) pairs of the records of two data sets, is usually the first step when linking the two data sets.
A data.table
with two columns,
.x
and .y
, is returned. Columns .x
and .y
are
row numbers from data.frame
s .x
and .y
respectively.
pair_blocking
and pair_minsim
are other methods
to generate pairs.
data("linkexample1", "linkexample2") pairs <- pair(linkexample1, linkexample2)
data("linkexample1", "linkexample2") pairs <- pair(linkexample1, linkexample2)
Generates all combinations of records from x
and y
where the
blocking variables are equal.
pair_blocking(x, y, on, deduplication = FALSE, add_xy = TRUE)
pair_blocking(x, y, on, deduplication = FALSE, add_xy = TRUE)
x |
first |
y |
second |
on |
the variables defining the blocks or strata for which
all pairs of |
deduplication |
generate pairs from only |
add_xy |
add |
Generating (all) pairs of the records of two data sets, is usually the first step when linking the two data sets. However, this often results in a too large number of records. Therefore, blocking is usually applied.
A data.table
with two columns,
.x
and .y
, is returned. Columns .x
and .y
are
row numbers from data.frame
s .x
and .y
respectively.
pair
and pair_minsim
are other methods
to generate pairs.
data("linkexample1", "linkexample2") pairs <- pair_blocking(linkexample1, linkexample2, "postcode")
data("linkexample1", "linkexample2") pairs <- pair_blocking(linkexample1, linkexample2, "postcode")
Generates all combinations of records from x
and y
where the
blocking variables are equal.
pair_minsim( x, y, on, minsim = 0, on_blocking = character(0), comparators = list(default_comparator), default_comparator = cmp_identical(), keep_simsum = TRUE, deduplication = FALSE, add_xy = TRUE )
pair_minsim( x, y, on, minsim = 0, on_blocking = character(0), comparators = list(default_comparator), default_comparator = cmp_identical(), keep_simsum = TRUE, deduplication = FALSE, add_xy = TRUE )
x |
first |
y |
second |
on |
the variables defining on which the pairs of records from |
minsim |
minimal similarity score. |
on_blocking |
variables for which the pairs have to match. |
comparators |
named list of functions with which the variables are compared.
This function should accept two vectors. Function should either return a vector
or a |
default_comparator |
variables for which no comparison function is defined using
|
keep_simsum |
add a variable |
deduplication |
generate pairs from only |
add_xy |
add |
Generating (all) pairs of the records of two data sets, is usually the first
step when linking the two data sets. However, this often results in a too
large number of records. pair_minsim
will only keep pairs with a
similarity score equal or larger than minsim
. The similarity score is
calculated by summing the results of the comparators for all variables
of on
.
Missing values in the variables on which the pairs are compared count as a similarity of 0.
A data.table
with two columns,
.x
and .y
, is returned. Columns .x
and .y
are
row numbers from data.frame
s .x
and .y
respectively.
pair
and pair_blocking
are other methods
to generate pairs.
data("linkexample1", "linkexample2") pairs <- pair_minsim(linkexample1, linkexample2, on = c("postcode", "address"), minsim = 1) # Either address or postcode has to match to keep a pair data("linkexample1", "linkexample2") pairs <- pair_minsim(linkexample1, linkexample2, on_blocking = "postcode", on = c("lastname", "firstname", "address"), minsim = 2) # Postcode has to match; from lastname, firstname, address there have to match # two or more (e.g. one mismatch is allowed).
data("linkexample1", "linkexample2") pairs <- pair_minsim(linkexample1, linkexample2, on = c("postcode", "address"), minsim = 1) # Either address or postcode has to match to keep a pair data("linkexample1", "linkexample2") pairs <- pair_minsim(linkexample1, linkexample2, on_blocking = "postcode", on = c("lastname", "firstname", "address"), minsim = 2) # Postcode has to match; from lastname, firstname, address there have to match # two or more (e.g. one mismatch is allowed).
Calculate weights and probabilities for pairs
## S3 method for class 'problink_em' predict( object, pairs = newdata, newdata = NULL, type = c("weights", "mpost", "probs", "all"), binary = FALSE, add = FALSE, comparators, inplace = FALSE, new_name = NULL, ... )
## S3 method for class 'problink_em' predict( object, pairs = newdata, newdata = NULL, type = c("weights", "mpost", "probs", "all"), binary = FALSE, add = FALSE, comparators, inplace = FALSE, new_name = NULL, ... )
object |
an object of type |
pairs |
a object with pairs for which to calculate weights. |
newdata |
an alternative name for the |
type |
a character vector of length one specifying what to calculate. See results for more information. |
binary |
convert comparison vectors to binary vectors using the comparison function in comparators. |
add |
add the predictions to the original pairs object. |
comparators |
a list of comparison functions (see |
inplace |
logical indicating whether |
new_name |
name of new object to assign the pairs to on the cluster
nodes (only relevant when pairs is of type |
... |
unused. |
When pairs
is of type pairs
, returns a data.table with either
the .x
and .y
columns from pairs
(when add = FALSE
)
or all columns of pairs
. To these columns are added:
In case of type = "weights"
a column weights
with the calculated
weights.
In case of type = "mpost"
a column mpost
with the calculated
posterior probabilities (probability that pair is a match given comparison vector.
In case of type = "prob"
the columns mprob
and uprob
with the
m and u-probabilites and mpost
and upost
with the posterior m- and
u-probabilities.
In case of type = "all"
all of the above.
In case of compare_pairs.cluster_pairs
, compare_pair.pairs
is called on
each cluster node and the resulting pairs are assigned to new_name
in
the environment reclin_env
. When new_name
is not given (or
equal to NULL) the original pairs on the nodes are overwritten.
Calculate EM-estimates of m- and u-probabilities
problink_em( formula, data, patterns, mprobs0 = list(0.95), uprobs0 = list(0.02), p0 = 0.05, tol = 1e-05, mprob_max = 0.999, uprob_min = 1e-04 )
problink_em( formula, data, patterns, mprobs0 = list(0.95), uprobs0 = list(0.02), p0 = 0.05, tol = 1e-05, mprob_max = 0.999, uprob_min = 1e-04 )
formula |
a formula object with the variables for which to calculate the
m- and u-probabilities. Should be of the form |
data |
data set with pairs on which to estimate the model. Alternatively
one can use the |
patterns |
table of patterns (as output by
|
mprobs0 , uprobs0
|
initial values of the m- and u-probabilities. These
should be lists with numeric values. The names of the elements in the list
should correspond to the names in |
p0 |
the initial estimate of the probability that a pair is a match. |
tol |
when the change in the m and u-probabilities is smaller than |
mprob_max |
maximum values of the estimated m-probabilities. Values equal to one can lead to numerical instabilities. |
uprob_min |
maximum values of the estimated m-probabilities. Values equal to zero can lead to numerical instabilities. |
Returns an object of type problink_em
. This is a list containing the
estimated mprobs
, uprobs
and overall linkage probability
p
. It also contains the table of comparison patterns
.
Fellegi, I. and A. Sunter (1969). "A Theory for Record Linkage", Journal of the American Statistical Association. 64 (328): pp. 1183-1210. doi:10.2307/2286061.
Herzog, T.N., F.J. Scheuren and W.E. Winkler (2007). Data Quality and Record Linkage Techniques, Springer.
data("linkexample1", "linkexample2") pairs <- pair_blocking(linkexample1, linkexample2, "postcode") pairs <- compare_pairs(pairs, c("lastname", "firstname", "address", "sex")) model <- problink_em(~ lastname + firstname + address + sex, data = pairs) summary(model)
data("linkexample1", "linkexample2") pairs <- pair_blocking(linkexample1, linkexample2, "postcode") pairs <- compare_pairs(pairs, c("lastname", "firstname", "address", "sex")) model <- problink_em(~ lastname + firstname + address + sex, data = pairs) summary(model)
Score pairs based on a number of comparison vectors
## S3 method for class 'cluster_pairs' score_simple( pairs, variable, on, w1 = 1, w0 = 0, wna = 0, new_name = NULL, ... ) score_simple(pairs, variable, on, w1 = 1, w0 = 0, wna = 0, ...) ## S3 method for class 'pairs' score_simple( pairs, variable, on, w1 = 1, w0 = 0, wna = 0, inplace = FALSE, ... )
## S3 method for class 'cluster_pairs' score_simple( pairs, variable, on, w1 = 1, w0 = 0, wna = 0, new_name = NULL, ... ) score_simple(pairs, variable, on, w1 = 1, w0 = 0, wna = 0, ...) ## S3 method for class 'pairs' score_simple( pairs, variable, on, w1 = 1, w0 = 0, wna = 0, inplace = FALSE, ... )
pairs |
a |
variable |
the name of the new variable to create in pairs. This will be a
logical variable with a value of |
on |
character vector of variables on which the score should be based. |
w1 |
a vector or list with weights for agreement for each of the
variables. It can either be a numeric vector of length 1 in which case the
same weight is used for all variables; A numeric vector of length equal to
the length of |
w0 |
a vector or list with weights for non-agreement for each of the
variables. See details for more information. For the format see |
wna |
a vector or list with weights for agreement for each of the
variables. See details for more information. For the format see |
new_name |
name of new object to assign the pairs to on the cluster nodes. |
... |
ignored |
inplace |
logical indicating whether |
The individual contribution of a variable x
to the total score is
given by x * w1 + (1-x) * w0
in case of non-NA
values and
wna
in case of NA
. This assumes that the values 1 corresponds
to complete agreement and the value 0 to complete non-agreement. In case of
complete agreement a variable contributes w1
to the total score and in
case of complete non-agreement it contributes w0
to the total score.
Returns the data.table
pairs
with the column variable
added in
case of score_simple.pairs
.
In case of score_simple.cluster_pairs
, score_simple.pairs
is called on
each cluster node and the resulting pairs are assigned to new_name
in
the environment reclin_env
. When new_name
is not given (or
equal to NULL) the original pairs on the nodes are overwritten.
data("linkexample1", "linkexample2") pairs <- pair_blocking(linkexample1, linkexample2, "postcode") compare_pairs(pairs, on = c("firstname", "lastname", "sex"), inplace = TRUE) score_simple(pairs, "score", on = c("firstname", "lastname", "sex")) # Change the default weights score_simple(pairs, "score", on = c("firstname", "lastname", "sex"), w1 = 2, w0 = -1, wna = NA) # Use a named vector; omited elements from w1 get a weight of 1; those from # w0 and wna a weight of 0. score_simple(pairs, "score", on = c("firstname", "lastname", "sex"), w1 = c("firstname" = 2, "lastname" = 3), w0 = c("firstname" = -1, "lastname" = -0.5)) # Use a named list; omited elements from w1 get a weight of 1; those from # w0 and wna a weight of 0. score_simple(pairs, "score", on = c("firstname", "lastname", "sex"), w1 = list("firstname" = 2, "lastname" = 3), w0 = list("firstname" = -1, "lastname" = -0.5))
data("linkexample1", "linkexample2") pairs <- pair_blocking(linkexample1, linkexample2, "postcode") compare_pairs(pairs, on = c("firstname", "lastname", "sex"), inplace = TRUE) score_simple(pairs, "score", on = c("firstname", "lastname", "sex")) # Change the default weights score_simple(pairs, "score", on = c("firstname", "lastname", "sex"), w1 = 2, w0 = -1, wna = NA) # Use a named vector; omited elements from w1 get a weight of 1; those from # w0 and wna a weight of 0. score_simple(pairs, "score", on = c("firstname", "lastname", "sex"), w1 = c("firstname" = 2, "lastname" = 3), w0 = c("firstname" = -1, "lastname" = -0.5)) # Use a named list; omited elements from w1 get a weight of 1; those from # w0 and wna a weight of 0. score_simple(pairs, "score", on = c("firstname", "lastname", "sex"), w1 = list("firstname" = 2, "lastname" = 3), w0 = list("firstname" = -1, "lastname" = -0.5))
Select matching pairs enforcing one-to-one linkage
## S3 method for class 'cluster_pairs' select_greedy( pairs, variable, score, threshold, preselect = NULL, id_x = NULL, id_y = NULL, ... ) ## S3 method for class 'cluster_pairs' select_n_to_m( pairs, variable, score, threshold, preselect = NULL, id_x = NULL, id_y = NULL, ... ) select_greedy( pairs, variable, score, threshold, preselect = NULL, id_x = NULL, id_y = NULL, ... ) ## S3 method for class 'pairs' select_greedy( pairs, variable, score, threshold, preselect = NULL, id_x = NULL, id_y = NULL, x = attr(pairs, "x"), y = attr(pairs, "y"), inplace = FALSE, include_ties = FALSE, n = 1L, m = 1L, ... ) select_n_to_m( pairs, variable, score, threshold, preselect = NULL, id_x = NULL, id_y = NULL, ... ) ## S3 method for class 'pairs' select_n_to_m( pairs, variable, score, threshold, preselect = NULL, id_x = NULL, id_y = NULL, x = attr(pairs, "x"), y = attr(pairs, "y"), inplace = FALSE, ... )
## S3 method for class 'cluster_pairs' select_greedy( pairs, variable, score, threshold, preselect = NULL, id_x = NULL, id_y = NULL, ... ) ## S3 method for class 'cluster_pairs' select_n_to_m( pairs, variable, score, threshold, preselect = NULL, id_x = NULL, id_y = NULL, ... ) select_greedy( pairs, variable, score, threshold, preselect = NULL, id_x = NULL, id_y = NULL, ... ) ## S3 method for class 'pairs' select_greedy( pairs, variable, score, threshold, preselect = NULL, id_x = NULL, id_y = NULL, x = attr(pairs, "x"), y = attr(pairs, "y"), inplace = FALSE, include_ties = FALSE, n = 1L, m = 1L, ... ) select_n_to_m( pairs, variable, score, threshold, preselect = NULL, id_x = NULL, id_y = NULL, ... ) ## S3 method for class 'pairs' select_n_to_m( pairs, variable, score, threshold, preselect = NULL, id_x = NULL, id_y = NULL, x = attr(pairs, "x"), y = attr(pairs, "y"), inplace = FALSE, ... )
pairs |
a |
variable |
the name of the new variable to create in pairs. This will be a
logical variable with a value of |
score |
name of the score/weight variable of the pairs. When not given
and |
threshold |
the threshold to apply. Pairs with a score above the threshold are selected. |
preselect |
a logical variable with the same length as |
id_x |
a integer vector with the same length as the number of rows in
|
id_y |
a integer vector with the same length as the number of rows in
|
... |
Used to pass additional arguments to methods |
x |
|
y |
|
inplace |
logical indicating whether |
include_ties |
when pairs for a given record have an equal weight, should all pairs be included. |
n |
an integer. Each element of x can be linked to at most n elements of y. |
m |
an integer. Each element of y can be linked to at most m elements of x. |
Both methods force one-to-one matching. select_greedy
uses a greedy
algorithm that selects the first pair with the highest weight.
select_n_to_m
tries to optimise the total weight of all of the
selected pairs. In general this will result in a better selection. However,
select_n_to_m
uses much more memory and is much slower and, therefore,
can only be used when the number of possible pairs is not too large.
Note that when include_ties = TRUE
the same record can still be
selected more than once. In that case the pairs will have an equal weight.
Returns the pairs
with the variable given by variable
added. This
is a logical variable indicating which pairs are selected as matches.
data("linkexample1", "linkexample2") pairs <- pair_blocking(linkexample1, linkexample2, "postcode") pairs <- compare_pairs(pairs, c("lastname", "firstname", "address", "sex")) model <- problink_em(~ lastname + firstname + address + sex, data = pairs) pairs <- predict(model, pairs, type = "mpost", add = TRUE, binary = TRUE) # Select pairs with a mpost > 0.5 and force one-to-one linkage pairs <- select_n_to_m(pairs, "ntom", "mpost", 0.5) pairs <- select_greedy(pairs, "greedy", "mpost", 0.5) table(pairs$ntom, pairs$greedy) # The same example as above using a cluster; library(parallel) cl <- makeCluster(2) pairs <- cluster_pair_blocking(cl, linkexample1, linkexample2, "postcode") compare_pairs(pairs, c("lastname", "firstname", "address", "sex")) model <- problink_em(~ lastname + firstname + address + sex, data = pairs) predict(model, pairs, type = "mpost", add = TRUE, binary = TRUE) # Select pairs with a mpost > 0.5 and force one-to-one linkage # select_n_to_m and select_greedy only work on pairs that are local; # therefore we first collect the pairs select_threshold(pairs, "selected", "mpost", 0.5) local_pairs <- cluster_collect(pairs, "selected") local_pairs <- select_n_to_m(local_pairs, "ntom", "mpost", 0.5) local_pairs <- select_greedy(local_pairs, "greedy", "mpost", 0.5) table(local_pairs$ntom, local_pairs$greedy) stopCluster(cl)
data("linkexample1", "linkexample2") pairs <- pair_blocking(linkexample1, linkexample2, "postcode") pairs <- compare_pairs(pairs, c("lastname", "firstname", "address", "sex")) model <- problink_em(~ lastname + firstname + address + sex, data = pairs) pairs <- predict(model, pairs, type = "mpost", add = TRUE, binary = TRUE) # Select pairs with a mpost > 0.5 and force one-to-one linkage pairs <- select_n_to_m(pairs, "ntom", "mpost", 0.5) pairs <- select_greedy(pairs, "greedy", "mpost", 0.5) table(pairs$ntom, pairs$greedy) # The same example as above using a cluster; library(parallel) cl <- makeCluster(2) pairs <- cluster_pair_blocking(cl, linkexample1, linkexample2, "postcode") compare_pairs(pairs, c("lastname", "firstname", "address", "sex")) model <- problink_em(~ lastname + firstname + address + sex, data = pairs) predict(model, pairs, type = "mpost", add = TRUE, binary = TRUE) # Select pairs with a mpost > 0.5 and force one-to-one linkage # select_n_to_m and select_greedy only work on pairs that are local; # therefore we first collect the pairs select_threshold(pairs, "selected", "mpost", 0.5) local_pairs <- cluster_collect(pairs, "selected") local_pairs <- select_n_to_m(local_pairs, "ntom", "mpost", 0.5) local_pairs <- select_greedy(local_pairs, "greedy", "mpost", 0.5) table(local_pairs$ntom, local_pairs$greedy) stopCluster(cl)
Select matching pairs with a score above or equal to a threshold
## S3 method for class 'cluster_pairs' select_threshold(pairs, variable, score, threshold, new_name = NULL, ...) select_threshold(pairs, variable, score, threshold, ...) ## S3 method for class 'pairs' select_threshold(pairs, variable, score, threshold, inplace = FALSE, ...)
## S3 method for class 'cluster_pairs' select_threshold(pairs, variable, score, threshold, new_name = NULL, ...) select_threshold(pairs, variable, score, threshold, ...) ## S3 method for class 'pairs' select_threshold(pairs, variable, score, threshold, inplace = FALSE, ...)
pairs |
a |
variable |
the name of the new variable to create in pairs. This will be a
logical variable with a value of |
score |
name of the score/weight variable of the pairs. When not given
and |
threshold |
the threshold to apply. Pairs with a score above or equal to the threshold are selected. |
new_name |
name of new object to assign the pairs to on the cluster nodes. |
... |
ignored |
inplace |
logical indicating whether |
Returns the pairs
with the variable given by variable
added. This
is a logical variable indicating which pairs are selected a matches.
data("linkexample1", "linkexample2") pairs <- pair_blocking(linkexample1, linkexample2, "postcode") pairs <- compare_pairs(pairs, c("lastname", "firstname", "address", "sex")) model <- problink_em(~ lastname + firstname + address + sex, data = pairs) pairs <- predict(model, pairs, type = "mpost", add = TRUE, binary = TRUE) # Select pairs with a mpost > 0.5 select_threshold(pairs, "selected", "mpost", 0.5, inplace = TRUE) # Example using cluster; # In general the syntax is exactly the same except for the first call to # to cluster_pair. Note the in general `inplace = TRUE` is implied when # working with a cluster; therefore the assignment back to pairs can be # omitted (also not a problem if it is not). library(parallel) data("linkexample1", "linkexample2") cl <- makeCluster(2) pairs <- cluster_pair(cl, linkexample1, linkexample2) compare_pairs(pairs, c("lastname", "firstname", "address", "sex")) model <- problink_em(~ lastname + firstname + address + sex, data = pairs) predict(model, pairs, type = "mpost", add = TRUE, binary = TRUE) # Select pairs with a mpost > 0.5 # Unlike the regular pairs: inplace = TRUE is implied here select_threshold(pairs, "selected", "mpost", 0.5) stopCluster(cl)
data("linkexample1", "linkexample2") pairs <- pair_blocking(linkexample1, linkexample2, "postcode") pairs <- compare_pairs(pairs, c("lastname", "firstname", "address", "sex")) model <- problink_em(~ lastname + firstname + address + sex, data = pairs) pairs <- predict(model, pairs, type = "mpost", add = TRUE, binary = TRUE) # Select pairs with a mpost > 0.5 select_threshold(pairs, "selected", "mpost", 0.5, inplace = TRUE) # Example using cluster; # In general the syntax is exactly the same except for the first call to # to cluster_pair. Note the in general `inplace = TRUE` is implied when # working with a cluster; therefore the assignment back to pairs can be # omitted (also not a problem if it is not). library(parallel) data("linkexample1", "linkexample2") cl <- makeCluster(2) pairs <- cluster_pair(cl, linkexample1, linkexample2) compare_pairs(pairs, c("lastname", "firstname", "address", "sex")) model <- problink_em(~ lastname + firstname + address + sex, data = pairs) predict(model, pairs, type = "mpost", add = TRUE, binary = TRUE) # Select pairs with a mpost > 0.5 # Unlike the regular pairs: inplace = TRUE is implied here select_threshold(pairs, "selected", "mpost", 0.5) stopCluster(cl)
Deselect pairs that are linked to multiple records
## S3 method for class 'cluster_pairs' select_unique( pairs, variable, preselect = NULL, n = 1, m = 1, id_x = NULL, id_y = NULL, ... ) select_unique( pairs, variable, preselect = NULL, n = 1, m = 1, id_x = NULL, id_y = NULL, ... ) ## S3 method for class 'pairs' select_unique( pairs, variable, preselect = NULL, n = 1, m = 1, id_x = NULL, id_y = NULL, x = attr(pairs, "x"), y = attr(pairs, "y"), inplace = FALSE, ... )
## S3 method for class 'cluster_pairs' select_unique( pairs, variable, preselect = NULL, n = 1, m = 1, id_x = NULL, id_y = NULL, ... ) select_unique( pairs, variable, preselect = NULL, n = 1, m = 1, id_x = NULL, id_y = NULL, ... ) ## S3 method for class 'pairs' select_unique( pairs, variable, preselect = NULL, n = 1, m = 1, id_x = NULL, id_y = NULL, x = attr(pairs, "x"), y = attr(pairs, "y"), inplace = FALSE, ... )
pairs |
a |
variable |
the name of the new variable to create in pairs. This will be a
logical variable with a value of |
preselect |
a logical variable with the same length as |
n |
do not select pairs with a y-record that is linked to more than
|
m |
do not select pairs with a m-record that is linked to more than
|
id_x |
a integer vector with the same length as the number of rows in
|
id_y |
a integer vector with the same length as the number of rows in
|
... |
Used to pass additional arguments to methods |
x |
|
y |
|
inplace |
logical indicating whether |
This function can be used to remove pairs for which there is ambiguity. For
example when a record from x
is linked to multiple records from
y
and we know that there are no duplicate records in y
(records
that belong to the same object), then we know that at least on of the two
links is incorrect but we cannot decide which of the two. In that case we may
want to decide that we will not link both records. Running
select_unique
with m == 1
will remove both records.
In case one wants to select one of the records randomly: select_greedy
will select the pair with the highest weight and in case of an equal weight
the first. Adding a random component to the weights will ensure a random
selection.
Returns the pairs
with the variable given by variable
added. This
is a logical variable indicating which pairs are selected as matches.
data("linkexample1", "linkexample2") pairs <- pair_blocking(linkexample1, linkexample2, "postcode") compare_pairs(pairs, on = c("lastname", "firstname", "address", "sex"), default_comparator = jaro_winkler(0.9), inplace = TRUE) score_simple(pairs, "score", on = c("lastname", "firstname", "address", "sex"), w1 = list(lastname = 2), inplace = TRUE) select_threshold(pairs, variable = "select", score = "score", threshold = 4.0, inplace = TRUE) select_unique(pairs, variable = "select_unique", preselect = "select")
data("linkexample1", "linkexample2") pairs <- pair_blocking(linkexample1, linkexample2, "postcode") compare_pairs(pairs, on = c("lastname", "firstname", "address", "sex"), default_comparator = jaro_winkler(0.9), inplace = TRUE) score_simple(pairs, "score", on = c("lastname", "firstname", "address", "sex"), w1 = list(lastname = 2), inplace = TRUE) select_threshold(pairs, variable = "select", score = "score", threshold = 4.0, inplace = TRUE) select_unique(pairs, variable = "select_unique", preselect = "select")
problink_em
Summarise the results from problink_em
## S3 method for class 'problink_em' summary(object, ...)
## S3 method for class 'problink_em' summary(object, ...)
object |
the |
... |
ignored; |
Returns the original object
with a data.frame
with the patterns
and corresponding m-, u-probabilities and weights added.
Create a table of comparison patterns
## S3 method for class 'cluster_pairs' tabulate_patterns(pairs, on, comparators, complete = TRUE, ...) tabulate_patterns(pairs, on, comparators, complete = TRUE, ...) ## S3 method for class 'pairs' tabulate_patterns(pairs, on, comparators, complete = TRUE, ...)
## S3 method for class 'cluster_pairs' tabulate_patterns(pairs, on, comparators, complete = TRUE, ...) tabulate_patterns(pairs, on, comparators, complete = TRUE, ...) ## S3 method for class 'pairs' tabulate_patterns(pairs, on, comparators, complete = TRUE, ...)
pairs |
a |
on |
variables from |
comparators |
a list with comparison functions for each of the
columns. When missing or |
complete |
add patterns that do not occur in the dataset to the result
(with |
... |
passed on to other methods. |
Since comparison vectors can contain continuous numbers (usually between 0 and 1), this could result in a very large number of possible comparison vectors. Therefore, the comparison vectors are passed on to the comparators in order to threshold them. This usually results in values 0 or 1. Missing values are usually codes as 0. However, this all depends on the comparison functions used. For more information see the documentation on the comparison functions.
Returns a data.frame
with all unique comparison patterns that exist
in pairs
, with a column n
added with the number of times each
pattern occurs.
data("linkexample1", "linkexample2") pairs <- pair_blocking(linkexample1, linkexample2, "postcode") pairs <- compare_pairs(pairs, c("lastname", "firstname", "address", "sex")) tabulate_patterns(pairs)
data("linkexample1", "linkexample2") pairs <- pair_blocking(linkexample1, linkexample2, "postcode") pairs <- compare_pairs(pairs, c("lastname", "firstname", "address", "sex")) tabulate_patterns(pairs)
Contains spelling variations found in various files of a set of town/village names. Names were selected that contain 'rdam' or 'rdm'. The correct/official names are also given. This data set can be used as an example data set for deduplication
Data frames with 584 records and two columns.
name the name of the town/village as found in the files
official_name the official/correct name