Apache Big Data Seville 2016 – Sparkler – Crawler on Apache Spark – Karanjeet Singh & Thamme Gowda

Sparkler – Crawler on Apache Spark – Karanjeet Singh & Thamme Gowda

A web crawler is a bot program that fetches resources from the web for the sake of building applications like search engines, knowledge bases, etc. In this presentation, Karanjeet Singh and Thamme Gowda will describe a new crawler called Sparkler (contraction of Spark-Crawler) that makes use of recent advancements in distributed computing and information retrieval domains by conglomerating various Apache projects like Spark, Kafka, Lucene/Solr, Tika, and Felix. Sparkler is extensible, highly scalable, and high-performance web crawler that is an evolution of Apache Nutch and runs on Apache Spark Cluster. GitHub Link – https://github.com/USCDataScience/sparkler

More information about this talk

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s