Apache Lucene/Solr/Search Track
Tuesday 16:15 UTC
Tales From The Trenches: Solr Operations
Mike Drob
There are many pitfalls that a team can fall into when designing and implementing a new Solr-based search application. We will draw on stories from the presenter's operational experience and distill the events into easy to understand patterns and anti-patterns. Topics covered would include query patterns, indexing patterns, and shard design.
An engineer with over a decade of distributed systems experience, Mike has spent most of his career helping enable others who are using big data platforms. He is a PMC member and committer on several Apache projects, and strongly believes that when people develop breadth in their expertise it builds better software all around. When not working, he enjoys photography, dogs, photography of dogs.
Tuesday 16:55 UTCImproving Search Availability: Striving for more 9s
Shubhro Roy
Availability is a critical aspect of any distributed system, especially when your customer's mission critical applications depend on it. But what does availability really mean for Search and how do we measure it? Once measured how do we ensure a multi-cluster deployment of Apache Solr with terabyte scale sharded inverted index hits the holy grail of 4 9s of availability ? How do we automatically detect failures with such systems and what are our options to handle and recover from such failures without human intervention ? In this talk we will discuss various architectural choices and deployment strategies we have adopted at Box to improve availability of search while supporting high-throughput, near real-time indexing, low latency and multi-tenancy. We will share our learnings from various issues we have faced running Solr at scale and how we have address them by building additional scaffolding or tweaking Solr itself. Come take a peek under the hood of Box Search.
Shubhro enjoys working with data at scale, be it indexing, mining or analyzing it. Currently he is part of the Search team at Box, building infrastructure components that enable millions of users to find relevant content. Prior to Box, Shubhro worked on full text database search at Oracle. He has been working on enterprise search and data discovery for the past 8 years after graduating from Carnegie Mellon University with Masters in Information Systems, specializing in Information Retrieval and Machine Learning.
Tuesday 17:35 UTCOpen Source Docs as Code
Cassandra Targett
This talk will review the Lucene community's experiences maintaining Solr documentation in the same way we maintain code. Prior to 2016, the Solr Reference Guide was only in Confluence (cwiki). Despite community agreement that docs are important, editing them was a separate process that was easy to put off. That put a burden on a few committers to update the content for each new release, and frequently each version's Guide was not complete for 2-3 months after a release was announced. In 2016 we decided to integrate the documentation with our source code. We converted Confluence pages to AsciiDoc files and started generating static HTML pages hosted in our main website. These changes allowed committers to update documentation as they changed the code. In an open source project where everyone is a volunteer and there are possibly only 1-2 people who understand any feature, this has been an incredibly empowering change. Today committer maintenance of docs is high enough that the Guide requires very little effort to prepare for publication. All Release Managers can publish it as part of the release process, reducing the burden on the few who knew their way around the old system. This engagement means the Guide can evolve quickly as community needs change. In this talk I'll share how we made these choices, the content and build tools we use, and how other projects can make updating docs a natural part of the code change process.
Cassandra has 20 years experience in search and knowledge management. She has been an Apache Lucene committer since 2013 and a member of the PMC since 2016. As Director of Engineering at Lucidworks, she manages the day-to-day work of the Solr development team.
Tuesday 18:15 UTCAn Anatomy of an Answer: Open NLP & Discourse Analysis - based Indexing
Boris Galitsky
Indexers usually index all text in documents. However, once we learn to "understand" the logic of a plain text, we will see how bad for a search it is to index the whole thing. Discourse analysis helps to select text fragment which should be matched with a potential query, and throw away the rest In this talk we will apply discourse linguistic to practical text search and discover that the majority of indexers which index all text perform very poorly for complex queries. Relying on standard relevance means such as TF*IDF does not alleviate this problem. We will explore how discourse analysis helps search by identifying text fragments which should be indexed and matched with potential queries, and those text fragments which would mislead the search and make its precision low. We will demonstrate how a discourse analysis - based indexer can be employed relying on Apache Open NLP project. The audience will learn how discourse analysis formalizes a logic of text to be searched and represents it as a discourse tree, a structure to represent a domain-independent logical organization of text essential for finding relevant fragments. We will also discuss how to proceed from search engines like SOLR to chatbots, where discourse analysis helps with dialogue management.
Boris Galitsky has been presenting talks on AI over last two decades and at Apache conferences over last few years. He contributed linguistic and machine learning technologies to Silicon Valley startups for last 25 years, as well as eBay and Oracle, where he is currently an architect of the Digital Assistant project. An author of three computer science books, 150+ publications and 20+ patents related to search, he is now working on a book "AI for CRM" to be published by Springer in 2021. Boris is Apache committer to OpenNLP where he created OpenNLP.Similarity component which is a basis for search engine and chatbot development.
Wednesday 16:15 UTCConcurrent Search In Lucene
Atri Sharma
Concurrent search is not a new feature in Lucene but has been unexplored. This talk will talk about the basics, benefits, when to use and when not to use and recent improvements in this area. Concurrent search can provide a massive gain for analytical queries, which are becoming more and more popular as data volumes grow. Single threaded queries do not take the complete advantage of available CPU resources -- something that concurrent query fixes. This talk will take audience through a complete know-hows and integrating with existing search platforms built using Lucene.
Database and search guy. Apache Lucene and Apache Solr committer. Major contributor to PostgreSQL.
Wednesday 16:55 UTCSolr's new Plugin System
David Smiley
Solr 8.4 has a new plugin system that portends of a better future much improved from today: (a) load plugins at runtime, (b) find 3rd party plugins in a registry, (c) fetch, install, and even configure plugins from the command line (CLI), (d) a more slimmed down Solr distribution that is more secure. After an overview, you will see this system demonstrated, after which you should feel comfortable in trying it out for yourself when you leave. Beyond the CLI demonstration, we'll look behind the covers a bit to show some of how it works. We'll finish with a discussion of the gaps and thus what the future hopefully holds as this new mechanism blossoms. You'll learn a bit about what it takes to "package" up your plugins too.
David Smiley is a prolific Apache Lucene/Solr committer/PMC member and ASF member. David has written books, delivered training, and speaks at meetups & conferences on this subject. Ultimately, his passion his hacking on Lucene & Solr. He works on search at Salesforce which graciously supports these endeavors.
Wednesday 17:35 UTCMonitoring Apache Solr Ecosystem on Kubernetes
Amrit Sarkar
Kubernetes is fast becoming the operating system for the Cloud and brings a ubiquity that has the potential for massive benefits for technology organizations. Applications/Microservices are moved to orchestration tools like Kubernetes to leverage features like horizontal autoscaling, fault tolerance, CICD and more. Apache Solr can be deployed on Kubernetes on a large-scale for a plethora of use cases. For such scale, effective metric dashboards, log analytics, monitoring, and alerting system is a requirement to make sure abnormal behaviors are detected, error diagnostics are performed and the ability to fine-tune the entire ecosystem to reach the best possible performance. In this talk, we discuss and compare various monitoring and analytics tools for the Solr ecosystem running on Kubernetes. From inbuilt features to third-party tools which provide powerful yet easy to use dynamic dashboards and OpenTracing support.
Amrit Sarkar is Cloud Search Reliability Engineer at Lucidworks Inc, California-based enterprise search technology company, with 4+ years experience in search domain and big data, e-commerce and product. He is working primarily on running search-based applications on Kubernetes, and developing and improving core components of Apache Solr.
Wednesday 18:15 UTCTowards an open source tool stack for e-commerce search
Eric Pugh, René Kriegler
Search teams in the e-commerce space want to own their search: they want to understand how exactly the retrieval works and optimise it according to their specific needs, both from the user and from the seller perspective. Implementing search using open source search engines, such as Solr and Elasticsearch, seems like a perfect match. Unfortunately, the open source solutions available today aren’t anywhere near reaching parity with a commercial solution out of the box, especially when it comes to optimizing search relevance and managing individual queries as a merchandiser. This leads to a very difficult buy vs build decision, especially for smaller teams that don’t have deep search expertise already and are faced with developing significant functionality for digital commerce from scratch. In this session we will introduce Chorus: an initiative to combine open source tools and libraries like Querqy (powerful query rewriting library), SMUI (a search management UI to boost and bury products and categories), and the Quepid, RRE, and Quaerite (search relevance assessment and tuning projects) into a single template to accelerate the development of your own e-commerce search, allowing you to shift from setting up basic search functionality to domain specific optimizations much faster.
Eric Pugh:
Fascinated by the craft of software development, Eric Pugh has been involved in the open source world as a developer, committer and user for the past fifteen years. He is a member of the Apache Software Foundation and continues to be very active in the Solr and Tika projects, as well as avidly reads every commit to the Zeppelin project! In biotech, financial services, and defense IT, he has helped European and American companies develop coherent strategies for embracing open source software. Eric became involved in Solr when he submitted the patch SOLR-284 for extracting text from binary files (such as PDF and MS Office formats), that subsequently became the single most popular patch as measured by votes! He co-authored the book Apache Solr Enterprise Search Server, now on its third edition. Today he helps OSC’s clients build their own search teams and improve their search maturity, both by leading projects and by acting as a trusted advisor.
René Kriegler:
René has been working as a freelance search consultant for clients in Germany and abroad for more than ten years. Although he is interested in all aspects of search and NLP, key areas include search relevance and e-commerce search. His technological focus is on Solr/Elasticsearch/Lucene. René is founder and co-organiser of MICES (Mix-Camp E-Commerce Search - a Berlin Buzzwords partner event). He maintains Querqy - an open source library for query pre-processing.