FeatherCast

The voice of The Apache Software Foundation

Apache Big Data Seville 2016 – Sparkler – Crawler on Apache Spark – Karanjeet Singh & Thamme Gowda

January 2, 2017
asfinfra

Sparkler – Crawler on Apache Spark – Karanjeet Singh & Thamme Gowda

A web crawler is a bot program that fetches resources from the web for the sake of building applications like search engines, knowledge bases, etc. In this presentation, Karanjeet Singh and Thamme Gowda will describe a new crawler called Sparkler (contraction of Spark-Crawler) that makes use of recent advancements in distributed computing and information retrieval domains by conglomerating various Apache projects like Spark, Kafka, Lucene/Solr, Tika, and Felix. Sparkler is extensible, highly scalable, and high-performance web crawler that is an evolution of Apache Nutch and runs on Apache Spark Cluster. GitHub Link – https://github.com/USCDataScience/sparkler

More information about this talk

Leave a Reply

Powered by WordPress.com.
%d bloggers like this: