Tutorial

Wayfarer is a web crawling framework written in Ruby. It works with plain HTTP and by automating web browsers interchangeably and is deployed with Redis and a message queue. During development it can execute fully in memory, without Redis.

Getting started

In an empty directory, generate a new Gemfile and install Wayfarer:

bundle init
bundle add activejob wayfarer
bundle install

Jobs, tasks and batches

Wayfarer builds on Active Job, the message queue abstraction of Rails. You can use Wayfarer without Rails of course, as we do here.

A message queue supports two operations: appending messages to the end and consuming messages from the front. This is how Wayfarer processes tasks, a string pair of URL and batch. Wayfarer enforces that URLs are not processed more than once within their batch (excluding retries).

When a task is consumed, it is processed by a job, a Ruby class.

Let's give ourselves a dummy_job.rb that routes all URLs to its index instance method, where we print the current task:

dummy_job.rb

require "activejob"
require "wayfarer"

class DummyJob < ActiveJob::Base
  include Wayfarer::Base

  route.to :index

  def index
    puts task
  end
end

We can perform our job from the command line with the wayfarer perform subcommand. In between ActiveJob's log output, we see that Wayfarer has generated a UUID for the batch since we did not pass it:

bundle exec wayfarer perform -r dummy_job.rb DummyJob https://example.com

[ActiveJob] [DummyJob] [68853491-...] Performing DummyJob (Job ID: 68853491-...) from Async(default) with arguments: #<Wayfarer::Task url="https://example.com", batch="63d14035-...">
#<Wayfarer::Task url="https://example.com", batch="63d14035-...">
[ActiveJob] [DummyJob] [68853491-...] Performed DummyJob (Job ID: 68853491-) from Async(default) in 507.65ms

If you don't provide a batch, Wayfarer uses a generated UUID instead. We could have also used DummyJob.crawl("https://example.com") in a Ruby script to enqueue the job programmatically.

Accessing page data

The page method returns an object representing the HTTP response or browser state. Wayfarer automatically parses HTML, XML, and JSON responses.

For HTML pages, page.doc returns a Nokogiri document, and page.meta returns a MetaInspector object for easy access to metadata.

Let's update dummy_job.rb to print the page title:

dummy_job.rb

require "activejob"
require "wayfarer"

class DummyJob < ActiveJob::Base
  include Wayfarer::Base

  route.to :index

  def index
    puts "Title: #{page.doc.title.strip}"
    puts "H1: #{page.doc.at('h1')&.text&.strip}"
  end
end

Running perform again will now show the title of the page you crawled.

Crawling and Staging

To crawl more pages, we use the stage method. This adds new URLs to the current batch. Wayfarer handles deduplication, ensuring that we don't get stuck in loops processing the same URL multiple times within the same batch.

We can use page.meta.links.internal to easily find all links on the current page that point to the same domain.

Update dummy_job.rb to crawl recursively:

dummy_job.rb

require "activejob"
require "wayfarer"

class DummyJob < ActiveJob::Base
  include Wayfarer::Base

  route.to :index

  def index
    puts "Crawling #{page.url}"

    # Stage all internal links found on the page
    stage page.meta.links.internal
  end
end

Executing the crawl

The wayfarer perform command we used earlier only executes a single job. To run a full crawl where staged jobs are picked up and processed, we use wayfarer execute.

This command starts an in-process async executor that will continue running until all jobs in the batch are complete.

bundle exec wayfarer execute -r dummy_job.rb DummyJob https://example.com

You will see output indicating that multiple pages are being crawled as Wayfarer follows the internal links.

Next steps

Now that you have a basic crawler running, you can explore more advanced features:

Routing: Learn how to route different URLs to different methods or handler classes.
Pages: Dive deeper into the Page object and custom response parsers.
Networking: Configure Wayfarer to use Headless Chrome (via Ferrum or Selenium) for pages that require JavaScript.
Configuration: Customize concurrency, user agents, and request headers.