Ranking the Web with Spark – Sylvain Zimmer
Common Search is building an open source search engine based on Common Crawl’s monthly dumps of several billion webpages. Ranking every URL on the Web in a transparent and reproducible way is core to the project.
In this presentation, Sylvain Zimmer will explain why Spark is a great match for the job, how the current ranking pipeline works and what challenges it faces to grow in scale and complexity, in order to improve the quality of search results.
Specifically, we will dive in the new Spark 2.0 features that made it practical to compute PageRank from Python on every URL found in Common Crawl, and show how anyone can reproduce and tweak the results on their cloud servers.