All Projects → olehmberg → winter

olehmberg / winter

Licence: Apache-2.0 license
WInte.r is a Java framework for end-to-end data integration. The WInte.r framework implements well-known methods for data pre-processing, schema matching, identity resolution, data fusion, and result evaluation.

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to winter

SchemaMapper
A .NET class library that allows you to import data from different sources into a unified destination
Stars: ✭ 41 (-59.41%)
Mutual labels:  tabular-data, data-integration, schema-matching
CommonCoreOntologies
The Common Core Ontology Repository holds the current released version of the Common Core Ontology suite.
Stars: ✭ 109 (+7.92%)
Mutual labels:  data-integration
Csvreader
csvreader library / gem - read tabular data in the comma-separated values (csv) format the right way (uses best practices out-of-the-box with zero-configuration)
Stars: ✭ 169 (+67.33%)
Mutual labels:  tabular-data
rosette-elasticsearch-plugin
Document Enrichment plugin for Elasticsearch
Stars: ✭ 25 (-75.25%)
Mutual labels:  identity-resolution
Mirador
Tool for visual exploration of complex data.
Stars: ✭ 186 (+84.16%)
Mutual labels:  tabular-data
nomenklatura
Framework and command-line tools for integrating FollowTheMoney data streams from multiple sources
Stars: ✭ 158 (+56.44%)
Mutual labels:  data-integration
Tui.grid
🍞🔡 The Powerful Component to Display and Edit Data. Experience the Ultimate Data Transformer!
Stars: ✭ 1,859 (+1740.59%)
Mutual labels:  tabular-data
cgpm
Library of composable generative population models which serve as the modeling and inference backend of BayesDB.
Stars: ✭ 24 (-76.24%)
Mutual labels:  tabular-data
datapackage-m
Power Query M functions for working with Tabular Data Packages (Frictionless Data) in Power BI and Excel
Stars: ✭ 26 (-74.26%)
Mutual labels:  tabular-data
zingg
Scalable identity resolution, entity resolution, data mastering and deduplication using ML
Stars: ✭ 655 (+548.51%)
Mutual labels:  identity-resolution
Miller
Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
Stars: ✭ 4,633 (+4487.13%)
Mutual labels:  tabular-data
Tad
A desktop application for viewing and analyzing tabular data
Stars: ✭ 2,275 (+2152.48%)
Mutual labels:  tabular-data
AI4Water
framework for developing machine (and deep) learning models for structured data
Stars: ✭ 35 (-65.35%)
Mutual labels:  tabular-data
Tgan
Generative adversarial training for generating synthetic tabular data.
Stars: ✭ 173 (+71.29%)
Mutual labels:  tabular-data
valentine
A tool facilitating matching for any dataset discovery method. Also, an extensible experiment suite for state-of-the-art schema marching methods.
Stars: ✭ 43 (-57.43%)
Mutual labels:  schema-matching
Tableprint
Pretty console printing 📋 of tabular data in python 🐍
Stars: ✭ 153 (+51.49%)
Mutual labels:  tabular-data
Npm Pdfreader
🚜 Read text and parse tables from PDF files.
Stars: ✭ 225 (+122.77%)
Mutual labels:  tabular-data
R-Learning-Journey
Some of the projects i made when starting to learn R for Data Science at the university
Stars: ✭ 19 (-81.19%)
Mutual labels:  data-integration
Machine-Learning-Roadmap
A roadmap for getting started with Machine Learning
Stars: ✭ 79 (-21.78%)
Mutual labels:  tabular-data
cosmosR
COSMOS (Causal Oriented Search of Multi-Omic Space) is a method that integrates phosphoproteomics, transcriptomics, and metabolomics data sets.
Stars: ✭ 30 (-70.3%)
Mutual labels:  data-integration

Web Data INTEgRation Framework (WInte.r)

The WInte.r framework [5] provides methods for end-to-end data integration. The framework implements well-known methods for data pre-processing, schema matching, identity resolution, data fusion, and result evaluation. The methods are designed to be easily customizable by exchanging pre-defined building blocks, such as blockers, matching rules, similarity functions, and conflict resolution functions. In addition, these pre-defined building blocks can be used as foundation for implementing advanced integration methods.

Contents

Quick Start: The section below provides an overview of the functionality of the WInte.r framework. As alternatives to acquaint yourself with the framework, you can also read the WInte.r Tutorial or have a look at the code examples in our Wiki!

Using WInte.r

You can include the WInte.r framework via the following Maven dependency:

<dependency>
    <groupId>de.uni-mannheim.informatik.dws.winter</groupId>
    <artifactId>winter-framework</artifactId>
    <version>1.4.1</version>
</dependency>

Functionality

The WInte.r framework covers all central steps of the data integration process, including data loading, pre-processing, schema matching, identity resolution, as well as data fusion. This section gives an overview of the functionality and the alternative algorithms that are provided for each of these steps.

Data Integration Process Example

Data Loading: WInte.r provides readers for standard data formats such as CSV, XML and JSON. In addition, WInte.r offers a specialized JSON format for representing tabular data from the Web together with meta-information about the origin and context of the data, as used by the Web Data Commons (WDC) Web Tables Corpora.

Pre-processing: During pre-processing you prepare your data for the methods that you are going to apply later on in the integration process. WInte.r WebTables provides you with specialized pre-processing methods for tabular data, such as:

  • Data type detection
  • Unit of measurement normalization
  • Header detection
  • Subject column detection (also known as entity name column detection)

Schema Matching: Schema matching methods find attributes in two schemata that have the same meaning. WInte.r provides three pre-implemented schema matching algorithms which either rely on attribute labels or data values, or exploit an existing mapping of records (duplicate-based schema matching) in order to find attribute correspondences.

  • Label-based schema matching
  • Instance-based schema matching
  • Duplicate-based schema matching

Identity Resolution: Identity resolution methods (also known as data matching or record linkage methods) identify records that describe the same real-world entity. The pre-implemented identity resolution methods can be applied to a single dataset for duplicate detection or to multiple datasets in order to find record-level correspondences. Beside of manually defining identity resolution methods, WInte.r also allows you to learn matching rules from known correspondences. Identity resolution methods rely on blocking (also called indexing) in order to reduce the number of record comparisons. WInte.r provides following pre-implemented blocking and identity resolution methods:

  • Blocking by single/multiple blocking key(s)
  • Sorted-Neighbourhood Method
  • Token-based identity resolution
  • Rule-based identity resolution

Data Fusion: Data fusion methods combine data from multiple sources into a single, consolidated dataset. For this, they rely on the schema- and record-level correspondences that were discovered in the previous steps of the integration process. However, different sources may provide conflicting data values. WInte.r allows you to resolve such data conflicts (decide which value to include in the final dataset) by applying different conflict resolution functions.

  • 11 pre-defined conflict resolution functions for strings, numbers and lists of values as well as data type independent functions.

Use cases

WInte.r can be used out-of-the-box to integrate data from multiple data sources. The framework can also be used as foundation for implementing more advanced, use case-specific integration methods. In the following we provide an example use case from each category.

Integration of Multiple Data Sources: Building a Movie Dataset

The WInte.r framework is used to integrate data from multiple sources within the Web Data Integration course offered by Professor Bizer at the University of Mannheim. The basic case study in this course is the integration of product data from multiple Web data sources. In addition, student teams use the WInte.r framework to integrate data about different topics as part of the projects that they conduct during the course.

Integration of Large Numbers of Data Sources: Augmenting the DBpedia Knowledge base with Web Table Data

Many web sites provide data in the form of HTML tables. Millions of such data tables have been extracted from the CommonCrawl web corpus by the Web Data Commons project [3]. Data from these tables can be used to fill missing values in large cross-domain knowledge bases such as DBpedia [2]. An example of how pre-defined building blocks from the WInte.r framework are combined into an advanced, use-case specific integration method is the T2K Match algorithm [1]. The algorithm is optimized to match millions of Web tables against a central knowledge base describing millions of instances belonging to hundreds of different classes (such a people or locations) [2]. The full source code of the algorithm, which includes advanced matching methods that combine schema matching and identity resolution, is available in the WInte.r T2K Match project.

Pre-processing for large-scale Matching: Stitching Web Tables for Improving Matching Quality

Tables on web pages ("web tables") cover a diversity of topics and can be a source of information for different tasks such as knowledge base augmentation or the ad-hoc extension of datasets. However, to use this information, the tables must first be integrated, either with each other or into existing data sources. The challenges that matching methods for this purpose have to overcome are the high heterogeneity and the small size of the tables. To counter these problems, web tables from the same web site can be stitched before running any of the existing matching systems. This means that web tables are combined based on a schema mapping, which results in fewer and larger stitched tables [4]. The source code of the stitching method is available in the Web Tables Stitching project.

Data Search for Data Mining (DS4DM)

Analysts increasingly have the problem that they know that some data which they need for a project is available somewhere on the Web or in the corporate intranet, but they are unable to find the data. The goal of the 'Data Search for Data Mining' (DS4DM) project is to extend the data mining platform RapidMiner with data search and data integration functionalities which enable analysts to find relevant data in potentially very large data corpora, and to semi-automatically integrate the discovered data with existing local data.

Contact

If you have any questions, please refer to the Winte.r Tutorial, Wiki, and the JavaDoc first. For further information contact alex [dot] brinkmann [at] informatik [dot] uni-mannheim [dot] de

License

The WInte.r framework can be used under the Apache 2.0 License.

If you use the WInte.r framework in any publication, please cite [5].

Acknowledgements

WInte.r is developed at the Data and Web Science Group at the University of Mannheim.

References

[1] Ritze, D., Lehmberg, O., & Bizer, C. (2015, July). Matching html tables to dbpedia. In Proceedings of the 5th International Conference on Web Intelligence, Mining and Semantics (p. 10). ACM.

[2] Ritze, D., Lehmberg, O., Oulabi, Y., & Bizer, C. (2016, April). Profiling the potential of web tables for augmenting cross-domain knowledge bases. In Proceedings of the 25th International Conference on World Wide Web (pp. 251-261). International World Wide Web Conferences Steering Committee.

[3] Lehmberg, O., Ritze, D., Meusel, R., & Bizer, C. (2016, April). A large public corpus of web tables containing time and context metadata. In Proceedings of the 25th International Conference Companion on World Wide Web (pp. 75-76). International World Wide Web Conferences Steering Committee.

[4] Lehmberg, O., & Bizer, C. (2017). Stitching web tables for improving matching quality. Proceedings of the VLDB Endowment, 10(11), 1502-1513.

[5] Lehmberg, O., Brinkmann, A., & Bizer, C. WInte. r - A Web Data Integration Framework. ISWC 2017.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].