NEWS
reclin2 0.5.0 (2024-02-09)
BUG FIXES
- Severe bug: 'greedy' and 'select_greedy' can give incorrect results. There
was a bug in 'greedy' (also affecting 'select_greedy') where keeping track
of the number of times records from the second data set are linked had an
error. This has been fixed.
reclin2 0.4.0 (2024-02-01)
NEW FEATURES
- Additional arguments to 'compare_vars' are passed on to the comparison
function.
- Added 'inplace' argument to 'predict'.
- The 'greedy' function is now exported.
- The 'identical' function has been removed. It was already deprecated. The
identical 'cmp_identical' can be used instead.
BUG FIXES
- 'identical' called itself resulting in a stack overflow. This has been
fixed.
- In extreme cases where the one of the m-probabilities converged to zero
‘problink_em' gave an error. The algorithm will now converge to a small
value and give a warning (as this usually means that the algorithm didn’t
converge to a valid solution.
reclin2 0.3.2
NEW FEATURES
- The 'select_unique' function has been added. This function filters pairs and
removes pairs that have been matched more than once. For pairs that are
matched more than once (with a high enough link quality) it is not possible
to decide which link is the correct one. In some linkage scenarios (with a
focus on reducing false matches) it is then preferable to remove both.
- The 'score_simple' function has been added which calculates a weighted sum
of the comparison vectors. Different weights are allowed for agreement,
non-agreement and missing values for each of the variables in the comparison
vector.
- The 'merge_pairs' function can combine different sets of pairs for the same
two data sets. For example, combine two sets of pairs generated with
different blocking variables.
- The functions 'identical', 'jaro_winkler', 'lcs' and 'jaccard' are
deprecated and will be removed in future versions of the package. Instead
use the functions 'cmp_identical', 'cmp_jarowinkler', 'cmp_lcs' and
'cmp_jaccard'. The function 'identical' caused a conflict with a function in
base R. Also this, hopefully, stresses that these functions return a
function.
- 'select_greedy' has 'n' and 'm' arguments that allow the user to specify
what type of linkage is allows (1-to-1, n-to-1, 1-to-m, n-to-m).
- 'select_greedy' has an 'include_ties' option that will include multiple pairs
for a record when these pairs have an equal weight/score.
- 'pair_minsim' and 'cluster_pair_minsim' now have an 'on_blocking' option
that functions similar to the 'on' argument of 'pair_blocking': pairs are
only generated when the agree exactly on the blocking keys.
- When selecting pairs using a threshold. Pairs are now selected when their
score is above or *equal to* the threshold. This has an effect for all of
the 'select_' functions.
BUG FIXES
- 'select_greedy' gave an error when the set of pairs is empty.
- The 'by_x' and 'by_y' arguments were not handled correctly. Fixed.
- The result from 'cluster_collect' can now be used without introducing
a warning from 'data.table'.
reclin2 0.2.0 (2022-09-24)
NEW FEATURES
- 'merge_pairs' combines two sets of pairs into one (like 'rbind'). Especially
useful in combination with 'pair_blocking' to generate pairs that are blocked
on multiple keys (e.g. they have match on either 'postcode' or 'lastname').
BUG FIXES
- 'pairs_minsim' gave an error when the number of records squared is larger
the maximum integer value.
- 'pairs_minsim' did not generate pairs when one of the keys contained
missing values.