All Projects → cleanzr → Dblink

cleanzr / Dblink

Licence: other
Distributed Bayesian Entity Resolution in Apache Spark

Programming Languages

scala
5932 projects

Projects that are alternatives of or similar to Dblink

LogDensityProblems.jl
A common framework for implementing and using log densities for inference.
Stars: ✭ 26 (-31.58%)
Mutual labels:  bayesian-inference, mcmc
BayesHMM
Full Bayesian Inference for Hidden Markov Models
Stars: ✭ 35 (-7.89%)
Mutual labels:  bayesian-inference, mcmc
DynamicHMCExamples.jl
Examples for Bayesian inference using DynamicHMC.jl and related packages.
Stars: ✭ 33 (-13.16%)
Mutual labels:  bayesian-inference, mcmc
Bridge.jl
A statistical toolbox for diffusion processes and stochastic differential equations. Named after the Brownian Bridge.
Stars: ✭ 99 (+160.53%)
Mutual labels:  bayesian-inference, mcmc
Pymc3
Probabilistic Programming in Python: Bayesian Modeling and Probabilistic Machine Learning with Aesara
Stars: ✭ 6,214 (+16252.63%)
Mutual labels:  mcmc, bayesian-inference
anesthetic
Nested Sampling post-processing and plotting
Stars: ✭ 34 (-10.53%)
Mutual labels:  bayesian-inference, mcmc
SMC.jl
Sequential Monte Carlo algorithm for approximation of posterior distributions.
Stars: ✭ 53 (+39.47%)
Mutual labels:  bayesian-inference, mcmc
pysgmcmc
Bayesian Deep Learning with Stochastic Gradient MCMC Methods
Stars: ✭ 31 (-18.42%)
Mutual labels:  bayesian-inference, mcmc
Dbda Python
Doing Bayesian Data Analysis, 2nd Edition (Kruschke, 2015): Python/PyMC3 code
Stars: ✭ 502 (+1221.05%)
Mutual labels:  mcmc, bayesian-inference
Bda r demos
Bayesian Data Analysis demos for R
Stars: ✭ 409 (+976.32%)
Mutual labels:  mcmc, bayesian-inference
webmc3
A web interface for exploring PyMC3 traces
Stars: ✭ 46 (+21.05%)
Mutual labels:  bayesian-inference, mcmc
Bda py demos
Bayesian Data Analysis demos for Python
Stars: ✭ 781 (+1955.26%)
Mutual labels:  mcmc, bayesian-inference
blangSDK
Blang's software development kit
Stars: ✭ 21 (-44.74%)
Mutual labels:  bayesian-inference, mcmc
MultiBUGS
Multi-core BUGS for fast Bayesian inference of large hierarchical models
Stars: ✭ 28 (-26.32%)
Mutual labels:  bayesian-inference, mcmc
SCICoNE
Single-cell copy number calling and event history reconstruction.
Stars: ✭ 20 (-47.37%)
Mutual labels:  bayesian-inference, mcmc
bayesian-stats-with-R
Material for a workshop on Bayesian stats with R
Stars: ✭ 55 (+44.74%)
Mutual labels:  bayesian-inference, mcmc
Probabilistic Models
Collection of probabilistic models and inference algorithms
Stars: ✭ 217 (+471.05%)
Mutual labels:  mcmc, bayesian-inference
Mcmc
Collection of Monte Carlo (MC) and Markov Chain Monte Carlo (MCMC) algorithms applied on simple examples.
Stars: ✭ 218 (+473.68%)
Mutual labels:  mcmc, bayesian-inference
Bayadera
High-performance Bayesian Data Analysis on the GPU in Clojure
Stars: ✭ 342 (+800%)
Mutual labels:  mcmc, bayesian-inference
Rstan
RStan, the R interface to Stan
Stars: ✭ 760 (+1900%)
Mutual labels:  mcmc, bayesian-inference

dblink: Distributed End-to-End Bayesian Entity Resolution

dblink is a Spark package for performing unsupervised entity resolution (ER) on structured data. It's based on a Bayesian model called blink (Steorts, 2015), with extensions proposed in (Marchant et al, 2019). Unlike many ER algorithms, dblink approximates the full posterior distribution over clusterings of records (into entities). This facilitates propagation of uncertainty to post-ER analysis, and provides a framework for answering probabilistic queries about entity membership.

dblink approximates the posterior using Markov chain Monte Carlo. It writes samples (of clustering configurations) to disk in Parquet format. Diagnostic summary statistics are also written to disk in CSV format—these are useful for assessing convergence of the Markov chain.

Documentation

The step-by-step guide includes information about building dblink from source and running it locally on a test data set. Further details about configuration options for dblink is provided here.

Example: RLdata

Two synthetic data sets RLdata500 and RLdata10000 are included in the examples directory as CSV files. These data sets were extracted from the RecordLinkage R package and have been used as benchmark data sets in the entity resolution literature. Both contain 10 percent duplicates and are non-trivial to link due to added distortion. Standard entity resolution metrics can be computed as unique ids are provided in the files. Config files for these data sets are included in the examples directory: see RLdata500.conf and RLdata10000.conf. To run these examples locally (in Spark pseudocluster mode), ensure you've built or obtained the JAR according to the instructions above, then change into the source code directory and run the following command:

$SPARK_HOME/bin/spark-submit \
  --master "local[*]" \
  --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties" \
  --conf "spark.driver.extraClassPath=./target/scala-2.11/dblink-assembly-0.2.0.jar" \
  ./target/scala-2.11/dblink-assembly-0.2.0.jar \
  ./examples/RLdata500.conf

(To run with RLdata10000 instead, replace RLdata500.conf with RLdata10000.conf.) Note that the config file specifies that output will be saved in the ./examples/RLdata500_results/ (or ./examples/RLdata10000_results) directory.

How to: Add dblink as a project dependency

Note: This won't work yet. Waiting for project to be accepted.

Maven:

<dependency>
  <groupId>com.github.cleanzr</groupId>
  <artifactId>dblink</artifactId>
  <version>0.2.0</version>
</dependency>

sbt:

libraryDependencies += "com.github.cleanzr" % "dblink" % "0.2.0"

How to: Build a fat JAR

You can build a fat JAR using sbt by running the following command from within the project directory:

$ sbt assembly

This should output a JAR file at ./target/scala-2.11/dblink-assembly-0.2.0.jar relative to the project directory. Note that the JAR file does not bundle Spark or Hadoop, but it does include all other dependencies.

Contact

If you encounter problems, please open an issue on GitHub. You can also contact the main developer by email <GitHub username> <at> gmail.com

License

GPL-3

Citing the package

Marchant, N. G., Steorts R. C., Kaplan, A., Rubinstein, B. I. P., Elazar, D. N. (2019). dblink: Distributed End-to-End Bayesian Entity Resolution. eprint arXiv:1909.06039 URL: https://arxiv.org/abs/1909.06039.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].