Using Apache Beam to get data in your data lake? In a agile company you don’t want to re-compile your ingestion pipeline every time a sprint finished. In this talk we go over all mechanisms and building blocks you need to make dynamic pipelines really work.
We’ll see why schemas are so important. How do we get these schemas in our pipelines and discuss methods to protect ourselves from data corruption and incompatible schema evolution.
The new features like schema aware PCollection get a thorough deep dive and finally we go over real world examples and position Apache Beam in the new PLT (Push Load Transform) world.
Access to real-time data is increasingly important for many organizations. At Lyft, we process millions of events per second in real-time to compute prices, balance marketplace dynamics, detect fraud, among many other use cases. To do so, we run dozens of Apache Flink and Apache Beam pipelines. Flink provides a powerful framework that makes it easy for non-experts to write correct, high-scale streaming jobs, while Beam extends that power to our large base of Python programmers.
Historically, we have run Flink clusters on bare, custom-managed EC2 instances. In order to achieve greater elasticity and reliability, we decided to rebuild our streaming platform on top of Kubernetes. In this session, I’ll cover how we designed and built an open-source Kubernetes operator for Flink and Beam, some of the unique challenges of running a complex, stateful application on Kubernetes, and some of the lessons we learned along the way.
Open source draws its strength from the communities that use and build it. It’s their diversity of perspective, skills, and accountability that makes individual projects stronger and builds a richer and solid ecosystem. But while open source is used by the entire world, that broad user community is not yet reflected in the contributor base. In fact, diversity in open source is significantly worse than in proprietary software. While we claim that contribution is open to all, clearly not everybody feels empowered or welcome to contribute to open source projects. And that’s a problem. How can we fix this? How can we enable all developers everywhere to contribute to open source and take advantage of the opportunities it presents? In Q1 2019, Google’s Open Source Strategy team commissioned a user research study to better understand why users do (or do not) contribute back to the project, and how documentation can help remove roadblocks to that project’s adoption and usage – and encourage and enable community contributions. In this talk, we’ll cover the methodology and participant profiles of the study, and describe the four essential user personas we identified – along with their critical user-to-contributor journeys. And we’ll also distill the research to provide concrete recommendations and best practices for creating documentation that helps your community flourish and your project thrive.
When you are starting on your open source adventure, there are lot ofnthings to learn that have very little to do with coding and insteadnrelate to interacting with people. Apache is, at its best, a group ofnpeople who are trying to share their experience and teach new projectsnand contributors how to successfully manage open source projects.nHowever, like the blind people each describing a part of an elephant,neach mentor brings their personal experience to the table, and thusncan give good, yet conflicting advice to new projects. However, thatnaggregate advice has helped many projects to become successful. Based on the author’s experience, this talk will take you through 10ncommon traps in running Apache projects, why they happen, and how tonavoid or mitigate them.
There are lots of companies offering training for, and around, the Apache ecosystem, as well as many other topics that all create their own training material. Keeping this material up to date, especially in the fast-moving open-source world is not an easy task and takes a lot of time and effort. And all the time someone not too far away is probably working on a very similar slide-deck to explain what ZooKeeper does… It is this fundamental issue that the Apache Training project was created to address – centralising training resources and making them easier to access and (re-)use. We believe that training material, while important, is only part of what constitutes good training. The larger part is an experienced trainer who actually understands how the key aspects can be applied in real life and also the consequences of not implementing things correctly.nGreat training tends to come from the experiences and stories that a trainer can share in addition to the actual material. Based on this underlying belief and the very real pain of updating training material the Apache Training project was born. In early 2019, it entered Apache Incubator and is already building a community and gaining various contributions. In this talk, we want to share with you some of the events and the thought process that went into the creation of this project, as well as our main goals and principles. The first part of this talk will focus on how we arrived at the decision that open-source training material is a good thing to have and a little bit of the early history of this project. In the second part, we will discuss what the project is actually about. We will demonstrate the version-control friendly system of designing slides in source control that is at the core of this project. On top of these source files, we aim to create a metadata repository which will enable everybody to efficiently search available content and easily create tailored decks from existing material. And last but very much not least, we will cover how you can contribute to and benefit from this project, as this is front and center an effort from the community for the community!
About 3 years ago, I had an idea of using Open-Source software to create the next generation of industrial it solutions. At ApacheCon 2017 in Miami I introduced this idea to the public with my ‘Building SCADA systems with Apache Software’. 2019 the Apache PLC4X project is an Apache TLP. In this talk I will not talk about technical details of the project itself, but all the steps we took on this journey from a community-building point of view. Starting way before writing the first line of code. After these last two years, I would claim that ‘community-building’ is by far the most challenging task when initiating a new project, but it’s also seems to be one we tend to treat as second-class citizens – even in established projects. Hopefully I will be able to show you how rewarding community-building can be … After all: Apache is all about community over code and it’s that way for a reason
The commercial software industry is plagued by numerous significant problems: Cost and schedule overruns, poor software quality, unrealistic expectations, products that fail to meet customer needs, and more. These problems are historic, chronic, and pervasive; they’ve been with us for decades, and changes to methodology haven’t had much effect in remedying them. By contrast, the Apache Software Foundation has been around for 20 years and the Apache Way – the framework of values and governance for Apache Software Foundation projects – has been the guiding force for dozens of highly successful software projects that are used in countless environments everywhere, run much of the Internet, and are probably used by every technical person in one way or another every day. In fact it would be hard to find any other software organization, open or proprietary, with a better track record. What if the Apache Way were applied to all software projects? Could the Apache Way be the answer to solve these universal software engineering challenges, for everybody, once and for all?
Baidu is one of the biggest Internet company, it was founded 20 years ago with 10,000 more engineers now.nit began to embraces open source recently years, including adopting InnerSource and contributing to open source community. InnerSource is the use of apache ways inside the company. n As the leader of this program, I will talk about how this happened, and what’s the challenges we faced and overcome. we need to set policies, define processes and enable tools , but the most important part is to cultivate open source culture, educate engineers what is apache way and how to cooperate inside company just like in open source community. And now, many projects inside Baidu have been adopting InnerSource, and some projects have became mature and open sourced to external world, even contributed to open source foundation.nWith my affords, Baidu has contributed three projects into apache software foundation as incubator projects,nthere are echarts, Doris, brpc. But open source is a long journey, we need to insist and keep going.
All podling releases need to be voted on by the incubator PMC before being released to the world. I’ll go through what the incubator PMC looks for in every release and what you can do to make it pass that IPMC vote and get your project one step closer to graduation. More importantly I’ll cover where you can get help if you need it. In this talk, I’ll describe current incubator and ASF policy, recent changes that you may not be aware of, and go into detail the legal requirements of common open source licenses and the best way to assemble your NOTICE and LICENSE files. Where possible I describe the reasons behind why things are done a certain which may not always be obvious from our documentation. I’ll show how I review a release and the simple tools I use. I’ll go through a worked example or two, including a fictional project called Apache Wombat, and cover common mistakes I’ve seen in releases.
Zhaopin.com is a Chinese online recruitment services provider. As a bilingual job board, Zhaopin.com has one of the largest selections of real-time job vacancies in China. All recruitment advertisements are provided by prominent Chinese and foreign companies and enterprises throughout China. Zhaopin.com provides professional HR services to over 2.2 million clients and its average daily page views are over 68 million.nApache Pulsar was developed to address several shortcomings of existing messaging systems including a lot of enterprise features, message durability, and lower message latency. Zhaopin.com had built its enterprise event bus By using RabbitMQ for years. As the company grows, the amount of its data is getting larger and larger, and the usage scenarios of message systems are also become varied. The original RabbitMQ based architecture was hard to afford. Ultimately, it chose Apache Pulsar to replace RabbitMQ based architecture in 2018. Since Apache Pulsar was deployed at August 2018, The data amount in Pulsar is increased from 30 million messages per day to 6 billion+ messages. By learning from a lot of practical experiences, they summarised the best practices to using Pulsar as an enterprise event bus. Penghui and Jia detail how Apache Pulsar meet the requirements from Zhaopin.com. How Apache Pulsar is used in Zhaoping.com, and What is the best practices to using Pulsar as an enterprise event bus. Along the way, they will highlight the advantages of Apache Pulsar over the old system, and how high durability, high throughput and low latency of Apache Pulsar make it ideally suited for enterprise event bus.