Apache Big Data Seville 2016 – Crawling the Web for Common Crawl – Sebastian Nagel

Crawling the Web for Common Crawl – Sebastian Nagel

Common Crawl is non-profit organization which regularily crawls a significant sample of the web and makes the data accessible free charge to everyone interested in running machine-scale analysis on web data. The presentation will demonstrate how to use the Common Crawl data covering data formats and tools as well as examples and derived datasets. The monthly crawls are run by Apache Nutch on Apache Hadoop. Sebastian will also share his experience from running a web-scale crawl on a small budget.

More information about this talk

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s