FeatherCast

The voice of The Apache Software Foundation

Apache Big Data Seville 2016 – Ranking the Web with Spark – Sylvain Zimmer

January 2, 2017
asfinfra

Ranking the Web with Spark – Sylvain Zimmer

Common Search is building an open source search engine based on Common Crawl’s monthly dumps of several billion webpages. Ranking every URL on the Web in a transparent and reproducible way is core to the project.

In this presentation, Sylvain Zimmer will explain why Spark is a great match for the job, how the current ranking pipeline works and what challenges it faces to grow in scale and complexity, in order to improve the quality of search results.

Specifically, we will dive in the new Spark 2.0 features that made it practical to compute PageRank from Python on every URL found in Common Crawl, and show how anyone can reproduce and tweak the results on their cloud servers.

More information about this talk

Leave a Reply

Powered by WordPress.com.
%d bloggers like this: