Apache Big Data Seville 2016 – Ranking the Web with Spark – Sylvain Zimmer

Ranking the Web with Spark – Sylvain Zimmer

Common Search is building an open source search engine based on Common Crawl’s monthly dumps of several billion webpages. Ranking every URL on the Web in a transparent and reproducible way is core to the project.

In this presentation, Sylvain Zimmer will explain why Spark is a great match for the job, how the current ranking pipeline works and what challenges it faces to grow in scale and complexity, in order to improve the quality of search results.

Specifically, we will dive in the new Spark 2.0 features that made it practical to compute PageRank from Python on every URL found in Common Crawl, and show how anyone can reproduce and tweak the results on their cloud servers.

More information about this talk

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s