Distributed In-Database Machine Learning with Apache MADlib (incubating) – Roman Shaposhnik
Data science is moving with gusto to the enterprise, where data often resides in relational databases with SQL as the main workload. So how can an enterprise add a data science dimension to their business without a major IT re-architecture?
Apache MADlib (incubating) is an innovative SQL-based open source library for scalable in-database analytics. It provides parallel implementations of mathematical, statistical and machine learning methods. Bringing machine learning computations to the data makes for excellent scale out performance on massively parallel processing (MPP) platforms like Greenplum database and Apache HAWQ (incubating).
In this talk, we will describe the origin of MADlib, review the architecture and common usage patterns, and look ahead to some interesting plans around performance acceleration.