The Web as a new data source for RapidMiner

Search and retrieve data tables from Google!

Posted by Edwin Yaqub (RapidMiner) on November 3, 2017

The newly updated versions of Data Search for Data Mining and Web Tables Extraction extensions have been released on the RapidMiner Marketplace. The Data Search extension provides a new operator called “Google Table Search” operator, which performs keywords-matching search on Google’s “Web Tables” index, which is an index of over a hundred million public HTML data tables as well on Google’s “Fusion Tables” index, which indexes over a million HTML data tables that have been made public by the users of Google Fusion Tables service.

The operator produces a list of website URLs, which contain the HTML data tables. This list can be passed to the new “Read HTML Tables” operator, which iteratively extracts the HTML data tables from these websites and makes them available as ExampleSets – the default data table model in RapidMiner. Among other usages, this discovered data can now be used to create or extend tabular corpus for the Data Search Backend. For details, please read the Blog post published on the RapidMiner Community portal on November 3, 2017.