Unconstrained and Correlation-based Search Methods released

New, innovative data searches released for DS4DM Backend

Posted by Benedikt Kleppmann (University of Mannheim) on May 22, 2018

The backend components within the DS4DM architecture are responsible for the management of data repositories, as well as the execution of search queries against these repositories. The new release of the DS4DM Backend API offers two new, innovative data search methods:

  • Unconstrained Data Search

    The existing DS4DM data search methods expected the user to know in advance which column she wants to add to a table. The search function would then extend her table with exactly this one column. Therefore, it was not possible to use the available data in an explorative fashion and extend a table with all attributes that can be filled with data from the repository. The Unconstrained data search enables you to do just this: It extends a table with all attributes that can be populated with data values so that the attribute density exceeds a provided threshold. Using a single data search operation, a table might thus be extended with dozens of new attributes.

    The new attributes can be used afterwards in mining processes and can lead to new, surprising insights. For example, given a table that contains information about lakes, the Unconstrained data search methods would add attributes such as “country in which the lake is located”, “surface area in km2”, “length of the lake” or “maximal depth” and will populate these attributes as far as possible using the data in the repository. Further application examples of the Unconstrained data search are provided on the DS4DM-Backend-Components website.

  • Correlation-based Data Search

    The Correlation-based data search extends a table with all attributes that correlate with one specific attribute of the original table. Unlike the Unconstrained data search, which extends a table by all possible attributes, the Correlation-based data search only extends the table with the subset of these attributes that correlate with a specific attribute of the original table.

    The Correlation-based data search could for instance be used to extend a table having the two columns “name of lake” and “surface area in km2”. We specify that the new attributes should correlate with “surface area in km2”. In this case, the Correlation-based search would extend the table with additional attributes such as “length of the lake” and “maximal depth”, because these attributes correlate with “surface area in km2” (correlation > 0.4).

The new data search methods no longer require the user to know in advance what she is looking for (no longer require the user to have a hypothesis to direct her search). Instead, the new search methods allow the user to develop hypotheses by exploring the richness of the available data.

A more detailed description of the new data search methods as well as an evaluation of the methods can be found on the DS4DM-Backend-Components website.