Apache Big Data Seville 2016 – Avro: Travel Across (r)evolution – Arek Osinski & Darek Eliasz

Avro: Travel Across (r)evolution – Arek Osinski & Darek Eliasz

In those days, we are generating enormous amount of data. Biggest challenge is hidden in transformation of raw data to knowledge. We would like to take you on a short travel and show our approach for conversion from non-structured world of microservices to the world with Avro schemas inside our data pipelines.

Avro is well known format for storing and online processing information of any kind. What are key features of this format? What are the common problems? Where you can meet pitfails? How this influences our Big Data ecosystem?

Whole story will be covered by examples from real life implementation.

More information about this talk

Apache Big Data Seville 2016 – Your Datascience Journey with Apache Zeppelin – Moon soo Lee, Anthony Corbacho & Jongyoul Lee

Your Datascience Journey with Apache Zeppelin – Moon soo Lee, Anthony Corbacho & Jongyoul Lee

Take a journey together to see how Apache Zeppelin started, how Apache Zeppelin helps your data science lifecycle, how Apache Zeppelin became popular TLP project. We’ll also see how community focus has been changed, from basic notebook feature, spark integration to advanced features like multi-tenancy. Lee moon soo will explain value of Apache Zeppelin with some key use case scenario demo. Also we’ll see eco-system around it – How various projects and companies are using Apache Zeppelin in their product and services in many different ways.

Finally, we’ll discuss about Apache Zeppelin’s future roadmap with some challenges that community have.

More information about this talk

Apache Big Data Seville 2016 – Scalable Private Information Retrieval: Introducing Apache Pirk (incubating) – Ellison Anne Williams

Scalable Private Information Retrieval: Introducing Apache Pirk (incubating) – Ellison Anne Williams

Querying information over TBs of data where no one can see what you query or the responses obtained? It sounds like science fiction, but it is actually the science of Private Information Retrieval (PIR). This talk will introduce Apache Pirk – a new incubating Apache project designed to provide a framework for scalable, distributed PIR. We will discuss the motivation for Apache Pirk, its distributed implementations in platforms such as Spark and Storm, itäó»s current algorithms, the power of homomorphic encryption, and take a look at the path forward.

More information about this talk

Apache Big Data Seville 2016 – Building Streaming Applications with Apache Apex – Thomas Weise & Chinmay Kolhatkar

Building Streaming Applications with Apache Apex – Thomas Weise & Chinmay Kolhatkar

Stream processing applications built on Apache Apex run on Hadoop clusters and typically power analytics use cases where availability, flexible scaling, high throughput, low latency and correctness are essential. These applications consume data from a variety of sources, including streaming sources like Apache Kafka, Kinesis or JMS, file based sources or databases. Processing results often need to be stored in external systems (sinks) for downstream consumers (pub-sub messaging, real-time visualization, Hive and other SQL databases etc.). Apex has the Malhar library with a wide range of connectors and other operators that are readily available to build applications. We will cover key characteristics like partitioning and processing guarantees, generic building blocks for new operators (write-ahead-log, incremental state saving, windowing etc.) and APIs for application specification.

More information about this talk

Apache Big Data Seville 2016 – Apache HBase: Overview and Use Cases – Apekshit Sharma

Apache HBase: Overview and Use Cases – Apekshit Sharma

NoSQL databases are critical in building Big Data applications. Apache HBase, one of the most popular NoSQL databases, is used by Facebook, Apple, eBay and hundreds of other enterprises to store, analyze and profit from their petabyte-scale volume of data. This talk will discuss

– motivation behind NoSql databases

– basic architecture of a popular NoSql system, Apache HBase

– some commonly seen big data usage patterns in industry, and when & how to use Apache HBase (or other better suited NoSQL database).

More information about this talk

ApacheCon Seville 2016 – Inner Sourcing Transcends Tech: Applying Open Source Principles in Marketing and Communications – Joanna Madej

Inner Sourcing Transcends Tech: Applying Open Source Principles in Marketing and Communications – Joanna Madej

As increasingly every company becomes a software company, the software development “way of doing things” begins bleeding into other departments and disciplines within business. Particularly in organizations where software is the product, many of the principles behind open source as a development methodology can apply to managing and expanding interdisciplinary projects in marketing and communications. In these companies, open source development practices can set a tone for the culture of the organization as a whole,creating a shifting from an open source *software* company, to an open source company more generally .

More information about this talk

ApacheCon Seville 2016 – Inner Sourcing 101- Jim Jagielski

Inner Sourcing 101- Jim Jagielski

Inner Sourcing is taking the Lesson’s Learned from how successful Open Source projects are run and managed and leveraging those techniques and principles in today’s Enterprise IT development. Jim will provide key insights on how you can benefit from Inner Sourcing based on his experience helping dozens of companies make that transition.

More information about this talk

Apache Big Data Seville 2016 – Real Time Aggregation with Kafka, Spark Streaming and ElasticSearch, Scalable Beyond Million RPS – Dibyendu Bhattacharya

Real Time Aggregation with Kafka, Spark Streaming and ElasticSearch, Scalable Beyond Million RPS – Dibyendu Bhattacharya

While building a massively scalable real time pipeline to collect transaction logs from network traffic, one of the major challenges was performing aggregation on streaming data on the fly. This was needed to compute multiple metrics across various dimensions which help our customer to see near real time views of application delivery and performance. In this talk, learn how we designed our real time pipeline for doing multi-stage aggregation powered by Kafka ,Spark Streaming and ElasticSearch. At InstartLogic we used custom Spark Receiver for Kafka which is used in first stage aggregation. The second stage includes Spark Streaming driven aggregation within given batch window . Final stage aggregation involves custom ElasticSearch plugins to aggregate across Batches. I will cover this multi-stage aggregation,including optimisation across all stages which is scalable beyond million RPS

More information about this talk

Apache Big Data Seville 2016 – Mining and Identifying Security Threat Using Spark SQL, HBase and Solr – Manidipa Mitra

Mining and Identifying Security Threat Using Spark SQL, HBase and Solr – Manidipa Mitra

This presentation will talk about how to deisgn a highly effective scalable/performant distributed system to find the identity theft and fraud by mining billions of records related to share holding for a leading financial organization. This will also discuss on how Tera bytes of data can be migrated from Oracle to Hadoop, stored in parquet format, processed in a distributed computing framework with Spark DataFrame and pushed to different service layer (HBase, Impala, Solr, HDFS) depends on the query/access pattern. This design will also throw light on how the frequent transactions were handled and data were pre-processed end of the day to meet the seconds response time SLA, creating thousands of report by mining millions of record in minutes time.

More information about this talk

Apache Big Data Seville 2016 – On-Premise, UI-Driven Hadoop/Spark/Flink/Kafka/Zeppelin-as-a-service – Jim Dowling

On-Premise, UI-Driven Hadoop/Spark/Flink/Kafka/Zeppelin-as-a-service – Jim Dowling

Since April 2016, SICS Swedish ICT has provided Hadoop/Spark/Flink/Kafka/Zeppelin-as-a-service to researchers in Sweden. We have developed a UI-driven multi-tenant platform (Apache v2 licensed) in which researchers securely develop and run their applications. Applications can be either deployed as jobs (batch or streaming) or written and run directly from Notebooks in Apache Zeppelin. All applications are run on YARN within a security framework built on project-based multi-tenancy. A project is simply a grouping of users and datasets. Datasets are first-class entities that can be securely shared between projects. Our platform also introduces a necessary condition for elasticity: pricing. Application execution time in YARN is metered and charged to projects, that also have HDFS quotas for disk usage. We also support project-specific Kafka topics that can also be securely shared.

More information about this talk

An "unofficial" podcast from the world of Apache!