Building a Search Engine for the Cuban Web -Jorge Betancourt Gonzalez
This talk will cover the transition of Solr from “just the inverted index for search” into the core’s technology of a Web Search Engine for the Cuban Web. The main purpose is to show how some of the more common features of today web search engines could be fulfilled by the use Apache Solr, which makes Solr the hearth of our system. Integration with several Apache projects will be covered and how this systems work together to build a full featured Web Search Engine, an Image Search Engine and a Real Time News search engine with alert capabilities all of this powered by the features offered by Solr and several Apache projects. Also the use of Solr itself to help monitor and run the different components of the system will be discussed. Essentially how to build a Web Search Engine using the power of the Apache Foundation.
Lucene And Solr Document Classification – Alessandro Benedetti
This presentation will start by introducing how Apache Lucene can be used to classify documents using data structures that already exist in your index instead of having to generate and supply external training sets.
Building on the introduction the focus will be on extensions of the Lucene Classification module that come in Lucene 6.0 and the Lucene Classification module’s incorporation in to Solr 6.1. These extensions will allow you to classify at a document level with individual field weighting, numeric field support, lat/lon fields etc.
The Solr ClassificationUpdateProcessor will be explored, such as how it works, and how to use it including basic and advanced features like multi class support and classification context filtering.
The presentation will include practical examples and real world use cases.
ETL Pipelines with OODT, Solr and Stuff – Tom Barber
Discover a number of Apache projects you may not have heard of and how they can help you process both Clinical and non Clinical data. Apache OODT developed by NASA allows users to ingest and store files and metadata along with process workflows. OODT along with CTakes allows us to extract clinical information from files and then process them and allow end users access to the extracted data.
We can then take these sources and manipulate them further creating a highly flexible ETL pipeline offering reliability and scalability. Backed by Apache SOLR users can then interrogate the data via web interfaces and instigate further post processing and investigation.
Of course you may not have a clinical use case, but the platforms can be repurposed and will allow you to go away and build your own, scalable data pipeline for processing and integstion.
Fast & Scalable Email System with Apache Solr – Strategies, Tradeoffs and Optimizations – Arnon Yogev
Email interaction has its unique characteristics and is different than traditional web search (for example in that users search their own private mailboxes and are often interested in recent emails rather than the archive).
Taking advantage of these characteristics, we were able to optimize our infrastructure in terms of indexing strategy and query optimization and achieve a significant gain in scalability and performance.
Arnon will present the various tradeoffs that were explored, including multi-tiered indexes, sorted indexes, query optimizations and more.
Arnon will then present the benchmark results that stress the importance of correctly designing a Solr infrastructure and tailoring it to oneäó»s specific use case.
Managing Deeply Nested Documents in Apache Solr – Anshum Gupta
Apache Solr in the recent past started supporting deeply-nested documents. Solr can now be used to perform search and faceting on documents such as nested email threads, comments and replies on social media, enriched and annotated documents etc. without having to flatten them before ingestion.
Anshum Gupta would discuss pre-processing of data so that it can be indexed in Solr, making it possible to perform complex search and statistical aggregation on top of it. He would also cover query formation for sample use cases of nested data and multiple options and features that Solr provides for faceting or aggregation of such documents.
By the end of this talk, Solr users would have a better understanding of both, how to work with features that Solr provides to find answers to interesting questions from deeply nested documents as well as work-arounds for the missing pieces.