reclin2
has the functionality to use a cluster created
by parallel
or snow
for record linkage. There
are a couple of advantages to this. First, record linkage can be a
computationally intensive problem as all records from both datasets have
to be compared to each other. Splitting the computation over multiple
cores or CPU’s can give a substantial speed benefit. The problem easily
to parallelize. Second, when using a snow
cluster, the
computation can be distributed over multiple machines allowing
reclin2
to use the memory of these multiple machined.
Besides computationally intensive, record linkage can also be memory
intensive as all pairs are stored in memory.
Parallelization over k
cluster nodes is realised by
randomly splitting the first dataset x
into k
equally sized parts and distribution over the nodes. The second dataset
y
is copied to each of the nodes. Therefore, it is
beneficial for memory consumption if the first dataset is the largest of
the two. On each node the local y
is compared to the local
x
and a local set of pairs is generated. For most
operations there exist methods for cluster_pairs
. These
usually consist of running the operations for the regular
pairs
on each of the nodes.
Below an example is given using a small cluster. It is assumed that the reader has read the introduction vignette and knows the general procedure of record linkage.
In this example the example in the introduction vignette is repeated using a cluster.
We will work with a pair of data sets with artificial data. They are tiny, but that allows us to see what happens. In this example we will perform ‘classic’ probabilistic record linkage.
> data("linkexample1", "linkexample2")
> print(linkexample1)
id lastname firstname address sex postcode
1 1 Smith Anna 12 Mainstr F 1234 AB
2 2 Smith George 12 Mainstr M 1234 AB
3 3 Johnson Anna 61 Mainstr F 1234 AB
4 4 Johnson Charles 61 Mainstr M 1234 AB
5 5 Johnson Charly 61 Mainstr M 1234 AB
6 6 Schwartz Ben 1 Eaststr M 6789 XY
> print(linkexample2)
id lastname firstname address sex postcode
1 2 Smith Gearge 12 Mainstreet <NA> 1234 AB
2 3 Jonson A. 61 Mainstreet F 1234 AB
3 4 Johnson Charles 61 Mainstr F 1234 AB
4 6 Schwartz Ben 1 Main M 6789 XY
5 7 Schwartz Anna 1 Eaststr F 6789 XY
We first have to start a cluster. Pairs can then be generated using
any of the cluster_pair_*
functions.
> library(parallel)
> cl <- makeCluster(2)
> pairs <- cluster_pair_blocking(cl, linkexample1, linkexample2, "postcode")
> print(pairs)
Cluster 'default' with size: 2
First data set: 6 records
Second data set: 5 records
Total number of pairs: 17 pairs
Blocking on: 'postcode'
Showing a random selection of pairs:
.x .y
<int> <int>
1: 5 3
2: 3 3
3: 5 1
4: 1 2
5: 1 1
6: 6 5
7: 4 2
8: 2 3
9: 4 1
10: 6 4
The print function collects a few (max 6) pairs from each of the
nodes and shows those. Other cluster_pair_*
functions are
cluster_pair
and cluster_pair_minsim
.
The cluster_pair_*
functions return an object of type
cluster_pairs
. Most other methods work the same as for
regular pairs. For example, to compare the pairs on variables:
> compare_pairs(pairs, on = c("lastname", "firstname", "address", "sex"),
+ default_comparator = cmp_jarowinkler(0.9), inplace = TRUE)
> print(pairs)
Cluster 'default' with size: 2
First data set: 6 records
Second data set: 5 records
Total number of pairs: 17 pairs
Blocking on: 'postcode'
Showing a random selection of pairs:
.x .y lastname firstname address sex
<int> <int> <num> <num> <num> <num>
1: 1 3 0.447619 0.4642857 0.9333333 1
2: 3 3 1.000000 0.4642857 1.0000000 1
3: 1 1 1.000000 0.4722222 0.9230769 NA
4: 3 2 0.952381 0.5833333 0.9230769 1
5: 5 3 1.000000 0.8492063 1.0000000 0
6: 4 2 0.952381 0.0000000 0.9230769 0
7: 4 1 0.447619 0.6428571 0.8641026 NA
8: 6 4 1.000000 1.0000000 0.6111111 1
9: 2 2 0.000000 0.0000000 0.8641026 0
10: 4 3 1.000000 1.0000000 1.0000000 0
The code above was copy-pasted from the introduction. Here the
argument inplace = TRUE
was used, which adds the new
variables to the existing pairs. One difference between regular
pairs
and cluster_pairs
is that most methods
will modify the existing pairs in place. Therefore, inplace
is ignored here and we should use:
> compare_pairs(pairs, on = c("lastname", "firstname", "address", "sex"),
+ default_comparator = cmp_jarowinkler(0.9))
> print(pairs)
Cluster 'default' with size: 2
First data set: 6 records
Second data set: 5 records
Total number of pairs: 17 pairs
Blocking on: 'postcode'
Showing a random selection of pairs:
.x .y lastname firstname address sex
<int> <int> <num> <num> <num> <num>
1: 5 1 0.447619 0.5555556 0.8641026 NA
2: 3 2 0.952381 0.5833333 0.9230769 1
3: 1 3 0.447619 0.4642857 0.9333333 1
4: 5 2 0.952381 0.0000000 0.9230769 0
5: 1 1 1.000000 0.4722222 0.9230769 NA
6: 6 4 1.000000 1.0000000 0.6111111 1
7: 2 3 0.447619 0.5396825 0.9333333 0
8: 4 2 0.952381 0.0000000 0.9230769 0
9: 4 3 1.000000 1.0000000 1.0000000 0
10: 2 1 1.000000 0.8888889 0.9230769 NA
Most methods for cluster_pairs
do have a
new_name
argument that will generate a new set of pairs on
the cluster nodes. For example, the following code will generate a new
set of pairs and will not modify the existing pairs:
> pairs2 <- compare_pairs(pairs, on =
+ c("lastname", "firstname", "address", "sex"), new_name = "pairs2")
> print(pairs2)
Cluster 'pairs2' with size: 2
First data set: 6 records
Second data set: 5 records
Total number of pairs: 17 pairs
Blocking on: 'postcode'
Showing a random selection of pairs:
.x .y lastname firstname address sex
<int> <int> <lgcl> <lgcl> <lgcl> <lgcl>
1: 1 1 TRUE FALSE FALSE NA
2: 1 2 FALSE FALSE FALSE TRUE
3: 5 2 FALSE FALSE FALSE FALSE
4: 5 1 FALSE FALSE FALSE NA
5: 3 3 TRUE FALSE TRUE TRUE
6: 4 2 FALSE FALSE FALSE FALSE
7: 4 1 FALSE FALSE FALSE NA
8: 2 2 FALSE FALSE FALSE FALSE
9: 2 1 TRUE FALSE FALSE NA
10: 4 3 TRUE TRUE TRUE FALSE
> print(pairs)
Cluster 'default' with size: 2
First data set: 6 records
Second data set: 5 records
Total number of pairs: 17 pairs
Blocking on: 'postcode'
Showing a random selection of pairs:
.x .y lastname firstname address sex
<int> <int> <num> <num> <num> <num>
1: 3 3 1.000000 0.4642857 1.0000000 1
2: 3 1 0.447619 0.4722222 0.8641026 NA
3: 1 1 1.000000 0.4722222 0.9230769 NA
4: 5 3 1.000000 0.8492063 1.0000000 0
5: 1 3 0.447619 0.4642857 0.9333333 1
6: 2 3 0.447619 0.5396825 0.9333333 0
7: 2 1 1.000000 0.8888889 0.9230769 NA
8: 2 2 0.000000 0.0000000 0.8641026 0
9: 4 1 0.447619 0.6428571 0.8641026 NA
10: 6 5 1.000000 0.5277778 1.0000000 0
The function compare_vars
offers more flexibility than
compare_pairs
. It can for example compare multiple
variables at the same time (e.g. compare birth day and month allowing
for swaps) or generate multiple results from comparing on one variable.
This method also works on cluster_pairs
.
The next step in the process, is to determine which pairs of records belong to the same entity and which do not. As in the introduction vignette we will use the classic method. Again, we hardly need to change the code from the introduction:
> m <- problink_em(~ lastname + firstname + address + sex, data = pairs)
> print(m)
M- and u-probabilities estimated by the EM-algorithm:
Variable M-probability U-probability
lastname 0.9990000 0.001152679
firstname 0.1999999 0.000100000
address 0.8999206 0.285831118
sex 0.3002011 0.285427112
Matching probability: 0.5885595.
> pairs <- predict(m, pairs = pairs, add = TRUE)
> print(pairs)
Cluster 'default' with size: 2
First data set: 6 records
Second data set: 5 records
Total number of pairs: 17 pairs
Blocking on: 'postcode'
Showing a random selection of pairs:
.x .y lastname firstname address sex weights
<int> <int> <num> <num> <num> <num> <num>
1: 3 1 0.447619 0.4722222 0.8641026 NA 0.6017106
2: 3 2 0.952381 0.5833333 0.9230769 1 4.0674910
3: 5 3 1.000000 0.8492063 1.0000000 0 8.5458257
4: 1 2 0.000000 0.5833333 0.8641026 1 -5.9463949
5: 5 2 0.952381 0.0000000 0.9230769 0 3.6961688
6: 2 3 0.447619 0.5396825 0.9333333 0 0.7937508
7: 4 2 0.952381 0.0000000 0.9230769 0 3.6961688
8: 6 5 1.000000 0.5277778 1.0000000 0 7.9139248
9: 2 1 1.000000 0.8888889 0.9230769 NA 8.6064218
10: 4 3 1.000000 1.0000000 1.0000000 0 15.4915816
We can then select the pairs with a weight above a threshold.
> pairs <- select_threshold(pairs, "threshold", score = "weights", threshold = 8)
> print(pairs)
Cluster 'default' with size: 2
First data set: 6 records
Second data set: 5 records
Total number of pairs: 17 pairs
Blocking on: 'postcode'
Showing a random selection of pairs:
.x .y lastname firstname address sex weights threshold
<int> <int> <num> <num> <num> <num> <num> <lgcl>
1: 3 3 1.000000 0.4642857 1.0000000 1 7.9350221 FALSE
2: 1 1 1.000000 0.4722222 0.9230769 NA 7.7103862 FALSE
3: 3 2 0.952381 0.5833333 0.9230769 1 4.0674910 FALSE
4: 5 1 0.447619 0.5555556 0.8641026 NA 0.6717426 FALSE
5: 3 1 0.447619 0.4722222 0.8641026 NA 0.6017106 FALSE
6: 4 3 1.000000 1.0000000 1.0000000 0 15.4915816 TRUE
7: 6 4 1.000000 1.0000000 0.6111111 1 14.6796595 TRUE
8: 2 1 1.000000 0.8888889 0.9230769 NA 8.6064218 TRUE
9: 4 2 0.952381 0.0000000 0.9230769 0 3.6961688 FALSE
10: 2 3 0.447619 0.5396825 0.9333333 0 0.7937508 FALSE
And this is roughly where we have to stop working with
cluster_pairs
. The subset of selected pairs remaining
should now be small enough that we can comfortably work locally. The
most computationally intensive steps have been done. When we are not
sure exactly what the threshold should be, we can also work with a more
conservative threshold. That should still give us enough of a reduction
in pairs that we can work locally. Using cluster_collect
we
can copy the selected pairs (or all pairs) locally:
> pairs <- select_threshold(pairs, "threshold", score = "weights", threshold = 0)
> local_pairs <- cluster_collect(pairs, "threshold")
> print(local_pairs)
First data set: 6 records
Second data set: 5 records
Total number of pairs: 15 pairs
Blocking on: 'postcode'
.x .y lastname firstname address sex weights threshold
<int> <int> <num> <num> <num> <num> <num> <lgcl>
1: 1 1 1.000000 0.4722222 0.9230769 NA 7.7103862 TRUE
2: 1 3 0.447619 0.4642857 0.9333333 1 0.8042090 TRUE
3: 3 1 0.447619 0.4722222 0.8641026 NA 0.6017106 TRUE
4: 3 2 0.952381 0.5833333 0.9230769 1 4.0674910 TRUE
5: 3 3 1.000000 0.4642857 1.0000000 1 7.9350221 TRUE
6: 5 1 0.447619 0.5555556 0.8641026 NA 0.6717426 TRUE
7: 5 2 0.952381 0.0000000 0.9230769 0 3.6961688 TRUE
8: 5 3 1.000000 0.8492063 1.0000000 0 8.5458257 TRUE
9: 2 1 1.000000 0.8888889 0.9230769 NA 8.6064218 TRUE
10: 2 3 0.447619 0.5396825 0.9333333 0 0.7937508 TRUE
11: 4 1 0.447619 0.6428571 0.8641026 NA 0.7713174 TRUE
12: 4 2 0.952381 0.0000000 0.9230769 0 3.6961688 TRUE
13: 4 3 1.000000 1.0000000 1.0000000 0 15.4915816 TRUE
14: 6 4 1.000000 1.0000000 0.6111111 1 14.6796595 TRUE
15: 6 5 1.000000 0.5277778 1.0000000 0 7.9139248 TRUE
local_pairs
is a regular pairs
object (and
therefore a data.table
) which can be operated upon as any
pairs
object. cluster_collect
also has the
option clear
which when TRUE
will delete the
pairs on the cluster nodes. After this we can use the code from the
introduction vignette:
> local_pairs <- compare_vars(local_pairs, "truth", on_x = "id", on_y = "id")
> local_pairs <- select_n_to_m(local_pairs, "weights", variable = "ntom", threshold = 0)
> table(local_pairs$truth, local_pairs$ntom)
FALSE TRUE
FALSE 11 0
TRUE 0 4
> linked_data_set <- link(local_pairs, selection = "ntom")
> print(linked_data_set)
Total number of pairs: 4 pairs
Key: <.y>
.y .x id.x lastname.x firstname.x address.x sex.x postcode.x .id
<int> <int> <int> <fctr> <fctr> <fctr> <fctr> <fctr> <int>
1: 1 2 2 Smith George 12 Mainstr M 1234 AB 2
2: 2 3 3 Johnson Anna 61 Mainstr F 1234 AB 3
3: 3 4 4 Johnson Charles 61 Mainstr M 1234 AB 4
4: 4 6 6 Schwartz Ben 1 Eaststr M 6789 XY 6
id.y lastname.y firstname.y address.y sex.y postcode.y
<int> <fctr> <fctr> <fctr> <fctr> <fctr>
1: 2 Smith Gearge 12 Mainstreet <NA> 1234 AB
2: 3 Jonson A. 61 Mainstreet F 1234 AB
3: 4 Johnson Charles 61 Mainstr F 1234 AB
4: 6 Schwartz Ben 1 Main M 6789 XY
The cluster_pair
object is a list with two elements:
cluster
with a copy of the parallel
or
snow
cluster.name
the name of the environment on the cluster nodes
in which the pairs are stored.On the cluster nodes there exists an environment
(reclin2::reclin_env
). For each set of pairs an environment
is created in that environment containing the pairs. To demonstrate, let
us get the first pair on each of the nodes:
> clusterCall(pairs$cluster, function(name) {
+ pairs <- reclin2:::reclin_env[[name]]$pairs
+ head(pairs, 1)
+ }, name = pairs$name)
[[1]]
First data set: 3 records
Second data set: 5 records
Total number of pairs: 1 pairs
Blocking on: 'postcode'
.x .y lastname firstname address sex weights threshold
<int> <int> <num> <num> <num> <num> <num> <lgcl>
1: 1 1 1 0.4722222 0.9230769 NA 7.710386 TRUE
[[2]]
First data set: 3 records
Second data set: 5 records
Total number of pairs: 1 pairs
Blocking on: 'postcode'
.x .y lastname firstname address sex weights threshold
<int> <int> <num> <num> <num> <num> <num> <lgcl>
1: 1 1 1 0.8888889 0.9230769 NA 8.606422 TRUE
cluster_pairs
Regular pairs
are also a data.table
.
Therefore, it is easy to manually create columns, select or aggregate.
As for cluster_pairs
the pairs are distributed over the
cluster nodes, this is more difficult for cluster_pairs
. In
order to help with this, reclin2
has two helper functions:
cluster_call
and cluster_modify_pairs
.
You can pass cluster_call
the cluster_pairs
object and a function. This function will be called on each cluster node
and will be passed the pairs
object, the local
x
and y
(in that order). This can be used to
modify the pairs, or calculate statistics from the pairs. The result of
the function calls is returned by cluster_call
. Therefore,
if the sole goal is to modify the pairs, make sure to return
NULL
(or at least something small). Below we use
cluster_call
to make a random stratified sample of
pairs:
> compare_vars(pairs, "id")
> cluster_call(pairs, function(pairs, ...) {
+ sel1 <- sample(which(pairs$id), 2)
+ sel2 <- sample(which(!pairs$id), 2)
+ pairs[, sample := FALSE]
+ pairs[c(sel1, sel2), sample := TRUE]
+ NULL
+ })
> sample <- cluster_collect(pairs, "sample")
cluster_modify_pairs
is very similar to
cluster_call
but is mainly meant for modifying the pairs
object. Although in the previous example we also used
cluster_call
for that. When the function passed to
cluster_modify_pairs
returns a data.table
,
this data.table
will overwrite the pairs
object. cluster_modify_pairs
also accepts a
new_name
argument. When set a new pairs object will be
created.
Let’s use the sample from above to estimate a model and then use
cluster_modify_pairs
to add the predictions to the
pairs:
> mglm <- glm(id ~ lastname + firstname, data = sample)
> cluster_modify_pairs(pairs, function(pairs, model, ...) {
+ pairs$pmodel <- predict(model, newdata = pairs, type = "response")
+ pairs
+ }, model = mglm)
And stop the cluster.