Apache Ignite – JCache and Beyond – Dmitriy Setrakyan
This presentation will provide a good overview of Apache Ignite project including a detailed look into distributed in-memory Data Grid, Compute Grid, Streaming, in memory SQL, and many other components provided by Apache Ignite. We will also go into detail of how existing in-memory caching products and data grids can be used to share memory across Apache Spark jobs and applications. We will also present a hands on demo demonstrating performance benefits of querying shared memory using SQL.
All kinds of data volume increases dramatically in recent years, new storage devices (NVMe SSD, flash SSD, etc.) can be utilized to improve data access performance. HDFS provides methodologies like HDFS Cache, Heterogeneous Storage Management (HSM) and Erasure Coding (EC) to provide such support, but it remains a big challenge to define and adjust different storage strategies for different data in a dynamic environment.
To overcome the challenge and improve the storage efficiency of HDFS, we will introduce a comprehensive solution, aka Smart Storage Management (SSM) in Apache Hadoop. HDFS operation data and system state information are collected from the cluster, based on the metrics collected SSM can extract some äóìdata access patternsäó and based on these patterns SSM will automatically make sophisticated usage of these methodologies to optimize HDFS storage efficiency.
Big Data Machine Learning with Apache PredictionIO – Simon Chan
Apache PredictionIO (incubating) provides a full stack machine learning environment on top of Apache Spark, making it easy for developers to iterate on production-deployable machine learning engines. Apache PredictionIO is designed for data scientists and developers to build predictive web services for real-world applications in a fraction of the time normally required.
In this talk, the speaker will introduce the latest developments of PredictionIO, and show how to use it to build and deploy predictive engines in real production environments. Using PredictionIO’s DASE design pattern, Simon will illustrate how developers can build machine learning applications with the separation of concerns (SoC) in mind. The speaker will also go over the future roadmap of Apache PredictionIO and some of its recent development.
Get in Control of Your Workflows with Apache Airflow – Christian Trebing.
Whenever you work with data, sooner or later you stumble across the definition of your workflows. At what point should you process your customer’s data? What subsequent steps are necessary? And what went wrong with your data processing last Saturday night?
At Blue Yonder we use Apache Airflow to solve these problems. It can be extended with new functionality by developing plugins in Python. With Airflow, we define workflows as directed acyclic graphs and get a shiny UI for free. Airflow comes with some task operators which can be used out of the box to complete certain tasks. For more specific cases, you can also develop new operators in your plugin.
This talk will explain the concepts behind Airflow, demonstrating how to define your own workflows and how to extend the functionality. You’ll also get to hea about our experiences using this tool in real-world scenarios.
Apache Hadoop is used to run jobs that execute tasks over multiple machines with complex dependencies between tasks. And at scale, there can be 10’s to 1000’s of tasks running over 100’s to 1000äó»s of machines which increases the challenge of making sense of their performance. Pipelines of such jobs that logically run a business workflow add another level of complexity. No wonder that the question of why Hadoop jobs run slower than expected remains a perennial source of grief for developers. In this talk, we will draw on our experience in debugging and analyzing Hadoop jobs to describe some methodical approaches to this and present current and new tracing and tooling ideas that can help semi-automate parts of this difficult problem.
In my group at Microsoft, we have worked with the United Nations, Guide Dogs for the Blind in the UK, several automotive companies, and Strí_er on a number of projects involving high scale geospatial data.
In this talk, I’ll share some of the best practices and patterns that have come out of those experiences: best practices for storing and indexing geospatial data at scale, incremental ingestion and slice processing of the data, and efficiently building and presenting progressive levels of detail.
The audience will walk away with an understanding of how to efficiently summarize data over a geographic area, general methods for doing ingestion with Apache Kafka (or other event ingestion systems), and incremental updates to large scale datasets with Apache Spark, and best practices around visualizing this data on the frontend.
Geospatial Track: Crowd Learning for Indoor Navigation – Thomas Burgess
indoo.rs enables location based services for indoor applications. With indoo.rs, developers can add new features to their products, including having locations trigger events, track assets, showing closest routes to other places. For this, we use WiFi/beacon radio infrastructure, mobile devices and our cloud which produce lots of geospatial time series data. The real-time indoor navigation fuses independent movement from custom 9D sensor fusion and position estimates obtained by comparing current signal readings to a reference map. This talk will discuss how we create and maintain these maps in our big data machine learning system which leverages crowd data through Kafka and Spark to run SLAM and context aware algorithms to create high quality trajectories. In addition to use in reference maps, these trajectories provide an additional input for our interactive analytics.
Geospatial Track: Geospatial Big Data: Software Architectures and the Role of APIs in Standardized Environments – Ingo Simonis, Open Geospatial Consortium (OGC)
A number of technologies have evolved around big data, in particular products from the Apache community such as Hadoop, Storm, Spark, Hive, or Cassandra. The geospatial community has developed a range of standards to handle geospatial data in an efficient way. Most of these standards are produced by the Open Geospatial Consortium (OGC) and implemented in the form of domain-agnostic data models and Web services. With the emerging demand for streamlined APIs, new questions emerge how access to Big Data in the geospatial community can be handled most efficiently, how existing standards serve these new demands and implementation realities with distributed Big Data repositories operated e.g. by the various space agencies. This presentation should stimulate the discussion of geospatial Big Data handling in standardized environments and explore the role of products from the Apache community.