Data indexing, data search and data integration

One of the ambitions of the DS4SM project is to design an automatic process of data integration able to deal with the unmanageable large number of data sources available in recent years both through the web but also within a company’s intranet. Lots of data is already available in tabular format, but the sheer number of data sources makes it increasingly impossible for the user to gain a comprehensive overview of the data available. What is the table about? Which are the attributes? Are there similar tables available? Can I expand my table with additional attributes?

We aim at assisting the user in the process of finding appropriate records for his quest. The project will improve the available state of the art by developing a data search method that does not require that the attributes to be found are known, but which makes it possible to search the attributes based on their correlation with existing local attributes. Such method will be integrated into RapidMiner to support iterative extension of data tables.

The important steps to achieve this goal is the design and implementation of methods for

data indexing
data search and
the backend services for data integration.

Our work on data indexing and search will be based on the prototype of the University of Mannheim [Lehmberg 2014] and we investigate how to exploit correspondences at schema and instance level to correctly identify the range of possible meanings of the request and to deliver large amounts of relevant integrable tables.

The first important step is Data Indexing. For the correct indexing of records, it is important to identify reliable pseudo key attributes (Subject attributes) and to recognize complex table header structures. Furthermore, it is important to normalize data points (e.g. units of measurement and timestamps) to reduce later conflict resolution in the data fusion phase.

The Data Search process will support:

Keywords search
Entity search (with input provided as a list of entities plus desired attributes to describe them)
Entire table plus the desired attributes for the extension of the table itself
Entire table plus specification of the target attributes: novel extension attributes are suggested based on correlation measures.

In December we started working on the design of the general architecture and the initial implementation of a backend service, merely for the purposes of use case exploration. The first simple use case consists of a small table where the type of described entities is known, the subject column is known and the extension attribute is specified by the user. Relevant tables are fetched by looking for tables describing relevant entities and containing the extraction attribute (which name exactly matches the one specified by the user). In future iterations we will relax all the exact-matching constraints and we will explore how to support the user not only to look for known attributes, but suggesting relevant attributes based on correlations in retrieved tables.

[Lehmberg 2014] Oliver Lehmberg, Dominique Ritze, Petar Ristoski, Kai Eckert, Heiko Paulheim and Christian Bizer: Extending Tables with Data from over a Million Websites. Semantic Web Challenge 2014, Winner of the Big Data Track, October 2014.