Uber – Your Realtime Data Pipeline is Arriving Now! – Ankur Bansal
Building data pipelines is pretty hard! Building a multi-datacenter active-active real time data pipeline for multiple classes of data with different durability, latency and availability guarantees is much harder.
Real time infrastructure powers critical pieces of Uber (think Surge) and in this talk we will discuss our architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka and Samza) and in-house technologies have helped Uber scale.
Large Scale SolrCloud Cluster Management via APIs – Anshum Gupta
Apache Solr is widely used by organizations to power their search platforms and often support multiple users. A lot of cluster management APIs were introduced over the last few releases, allowing the users to to manage operations ranging from replica placement to forcing leader elections via API calls. At the end of this talk, intermediate Solr users would understand what’s available, and when can they avoid direct interference with the system, leading to more stable clusters and lower chances of nodes going down. The attendees would also be much better equipped to build their own SolrCloud cluster management tools. I would also talk about when not to use these APIs and what’s planned in the near future to handle specific operational use cases.
Fast & Scalable Email System with Apache Solr – Strategies, Tradeoffs and Optimizations – Arnon Yogev
Email interaction has its unique characteristics and is different than traditional web search (for example in that users search their own private mailboxes and are often interested in recent emails rather than the archive).
Taking advantage of these characteristics, we were able to optimize our infrastructure in terms of indexing strategy and query optimization and achieve a significant gain in scalability and performance.
Arnon will present the various tradeoffs that were explored, including multi-tiered indexes, sorted indexes, query optimizations and more.
Arnon will then present the benchmark results that stress the importance of correctly designing a Solr infrastructure and tailoring it to oneäó»s specific use case.
Managing Deeply Nested Documents in Apache Solr – Anshum Gupta
Apache Solr in the recent past started supporting deeply-nested documents. Solr can now be used to perform search and faceting on documents such as nested email threads, comments and replies on social media, enriched and annotated documents etc. without having to flatten them before ingestion.
Anshum Gupta would discuss pre-processing of data so that it can be indexed in Solr, making it possible to perform complex search and statistical aggregation on top of it. He would also cover query formation for sample use cases of nested data and multiple options and features that Solr provides for faceting or aggregation of such documents.
By the end of this talk, Solr users would have a better understanding of both, how to work with features that Solr provides to find answers to interesting questions from deeply nested documents as well as work-arounds for the missing pieces.