Package: reclin2 0.5.0

Jan van der Laan

reclin2: Record Linkage Toolkit

Functions to assist in performing probabilistic record linkage and deduplication: generating pairs, comparing records, em-algorithm for estimating m- and u-probabilities (I. Fellegi & A. Sunter (1969) <doi:10.1080/01621459.1969.10501049>, T.N. Herzog, F.J. Scheuren, & W.E. Winkler (2007), "Data Quality and Record Linkage Techniques", ISBN:978-0-387-69502-0), forcing one-to-one matching. Can also be used for pre- and post-processing for machine learning methods for record linkage. Focus is on memory, CPU performance and flexibility.

Authors:Jan van der Laan [aut, cre]

reclin2_0.5.0.tar.gz
reclin2_0.5.0.zip(r-4.5)reclin2_0.5.0.zip(r-4.4)reclin2_0.5.0.zip(r-4.3)
reclin2_0.5.0.tgz(r-4.4-x86_64)reclin2_0.5.0.tgz(r-4.4-arm64)reclin2_0.5.0.tgz(r-4.3-x86_64)reclin2_0.5.0.tgz(r-4.3-arm64)
reclin2_0.5.0.tar.gz(r-4.5-noble)reclin2_0.5.0.tar.gz(r-4.4-noble)
reclin2_0.5.0.tgz(r-4.4-emscripten)reclin2_0.5.0.tgz(r-4.3-emscripten)
reclin2.pdf |reclin2.html
reclin2/json (API)
NEWS

# Install 'reclin2' in R:
install.packages('reclin2', repos = c('https://djvanderlaan.r-universe.dev', 'https://cloud.r-project.org'))

Peer review:

Bug tracker:https://github.com/djvanderlaan/reclin2/issues

Uses libs:
  • c++– GNU Standard C++ Library v3
Datasets:
  • linkexample1 - Tiny example dataset for probabilistic linkage
  • linkexample2 - Tiny example dataset for probabilistic linkage
  • town_names - Spelling variations of a set of town names

On CRAN:

7.61 score 39 stars 1 packages 87 scripts 507 downloads 33 exports 4 dependencies

Last updated 9 months agofrom:1739f2db25. Checks:OK: 7 NOTE: 2. Indexed: yes.

TargetResultDate
Doc / VignettesOKNov 05 2024
R-4.5-win-x86_64NOTENov 05 2024
R-4.5-linux-x86_64NOTENov 05 2024
R-4.4-win-x86_64OKNov 05 2024
R-4.4-mac-x86_64OKNov 05 2024
R-4.4-mac-aarch64OKNov 05 2024
R-4.3-win-x86_64OKNov 05 2024
R-4.3-mac-x86_64OKNov 05 2024
R-4.3-mac-aarch64OKNov 05 2024

Exports:add_from_xadd_from_ycluster_callcluster_collectcluster_modify_pairscluster_paircluster_pair_blockingcluster_pair_minsimcmp_identicalcmp_jaccardcmp_jarowinklercmp_lcscompare_pairscompare_varsdeduplicate_equivalenceget_inspect_pairsgreedyjaccardjaro_winklerlcslinkmatch_n_to_mmerge_pairspairpair_blockingpair_minsimproblink_emscore_simpleselect_greedyselect_n_to_mselect_thresholdselect_uniquetabulate_patterns

Dependencies:data.tablelpSolveRcppstringdist

Deduplication using reclin2

Rendered fromdeduplication.mdusingsimplermarkdown::mdweave_to_htmlon Nov 05 2024.

Last update: 2023-07-06
Started: 2021-11-08

Introduction to reclin2

Rendered fromintroduction.mdusingsimplermarkdown::mdweave_to_htmlon Nov 05 2024.

Last update: 2023-08-25
Started: 2021-12-19

Record linkage using machine learning

Rendered fromrecord_linkage_using_machine_learning.mdusingsimplermarkdown::mdweave_to_htmlon Nov 05 2024.

Last update: 2023-07-06
Started: 2021-11-09

Using a cluster for record linkage

Rendered fromusing_a_cluster_for_record_linkage.mdusingsimplermarkdown::mdweave_to_htmlon Nov 05 2024.

Last update: 2023-07-06
Started: 2022-01-05

Readme and manuals

Help Manual

Help pageTopics
Add a variable from one of the data sets to pairsadd_from_x add_from_y
Call a function on each of the worker nodes and pass it the pairscluster_call
Collect pairs from cluster nodescluster_collect
Call a function on each of the worker nodes to modify the pairs on the nodecluster_modify_pairs
Generate all possible pairs using multiple processescluster_pair
Generate pairs using simple blocking using multiple processescluster_pair_blocking
Generate pairs with a minimal similarity using multiple processescluster_pair_minsim
Comparison functionscmp_identical cmp_jaccard cmp_jarowinkler cmp_lcs jaccard jaro_winkler lcs
Compare pairs on a set of variables common in both data setscompare_pairs compare_pairs.cluster_pairs compare_pairs.pairs
Compare pairs on given variablescompare_vars compare_vars.cluster_pairs compare_vars.pairs
Deduplication using equivalence groupsdeduplicate_equivalence
Get a subset of pairs to inspectget_inspect_pairs
Greedy one-to-one matching of pairsgreedy
Use the selected pairs to generate a linked data setlink
Tiny example dataset for probabilistic linkagelinkexample1 linkexample2
Force n to m matching on a set of pairsmatch_n_to_m
Merge two sets of pairs into onemerge_pairs merge_pairs.cluster_pairs merge_pairs.pairs rbind.cluster_pairs rbind.pairs
Generate all possible pairspair
Generate pairs using simple blockingpair_blocking
Generate pairs with a minimal similaritypair_minsim
Calculate weights and probabilities for pairspredict.problink_em
Calculate EM-estimates of m- and u-probabilitiesproblink_em
Score pairs based on a number of comparison vectorsscore_simple score_simple.cluster_pairs score_simple.pairs
Select matching pairs enforcing one-to-one linkageselect_greedy select_greedy.cluster_pairs select_greedy.pairs select_n_to_m select_n_to_m.cluster_pairs select_n_to_m.pairs
Select matching pairs with a score above or equal to a thresholdselect_threshold select_threshold.cluster_pairs select_threshold.pairs
Deselect pairs that are linked to multiple recordsselect_unique select_unique.cluster_pairs select_unique.pairs
Summarise the results from 'problink_em'summary.problink_em
Create a table of comparison patternstabulate_patterns tabulate_patterns.cluster_pairs tabulate_patterns.pairs
Spelling variations of a set of town namestown_names