Jobs

Jobs are Active Jobs that use a DSL to process tasks that they read from a message queue.

Instead of implementing Active Job's #perform method yourself, you declare routes to instance methods, like web applications route incoming requests. Only URLs that match a route are retrieved and processed. All other URLs are considered successfully processed. The action has access to the retrieved page, the user agent that retrieved the page and the current task:

class DummyJob < ActiveJob::Base
  include Wayfarer::Base

  route.to :index

  def index
    task # => #<Wayfarer::Task>
    page # => #<Wayfarer::Page>
    user_agent # => Browser or HTTP client
  end
end

You can start a crawl by appending a task to the message queue for the URL with ::crawl. If you don't provide a batch, Wayfarer generates a UUID:

task = DummyJob.crawl("https://example.com")
# => #<Wayfarer::Task url="https://example.com", batch="498a13e0-...">

This is exactly the same as calling Active Job's #perform_later and passing a task directly:

task = Wayfarer::Task.new("https://example.com", SecureRandom.uuid)
DummyJob.perform_later(task)

Instead of a generated UUID, you can also set your own batch:

DummyJob.crawl("https://example.com", batch: "my-batch")

You can also use Wayfarer's CLI to enqueue a task:

wayfarer enqueue --batch my-batch DummyJob "https://example.com"

Following URLs

Jobs navigate crawls by staging URLs with stage(urls). When you stage a URL, it is appended verbatim to an internal set. Once the action returns, all URLs in the set are appended as tasks to the message queue.

class DummyJob < ActiveJob::Base
  include Wayfarer::Base

  route.to :index

  def index
    # Follow all out-going links of the page
    stage page.meta.links.external
  end
end

Accessing the current task

If the task's URL matched a route, the URL is retrieved over the network, and the method that was routed to is called. The task is available as #task:

class DummyJob < ActiveJob::Base
  include Wayfarer::Base

  route.to :index

  def index
    task.url # => "https://example.com"
    task.batch # => "my-batch"
  end
end

Accessing the current page

You have access to the retrieved page:

class DummyJob < ActiveJob::Base
  include Wayfarer::Base

  route.to :index

  def index
    page.url         # => "https://example.com"
    page.body        # => "<html>..."
    page.status_code # => 200
    page.headers     # { "Content-Type" => ... }
    page.doc         # Only present for certain Content-Types
  end
end

Routing URLs to methods and extracting `params`

Jobs have a routing DSL that allows you to map URLs to methods and extract URL data:

class DummyJob < ActiveJob::Base
  include Wayfarer::Base

  route do
    path "/users/:id/profile", to: :index
  end

  def index
    params[:id] # => "42"
  end
end

DummyJob.crawl("https://example.com/users/42/profile?foo=bar")

Controlling the user agent

You can control the browser or HTTP client that retrieved the page:

Wayfarer.config[:network][:agent] = :ferrum # Chrome DevTools Protocol

class DummyJob < ActiveJob::Base
  include Wayfarer::Base

  route.to :index

  def index
    user_agent.save_screenshot("capture.png")
  end
end

Restricting the processed Content-Types

By default, jobs process pages regardless of their Content-Type response header. You can allow a list of Content-Types as strings and Regexps and opt out of the default behaviour. Once at least one Content-Type is allowed, other Content-Types don't get processed:

class DummyJob < ActiveJob::Base
  include Wayfarer::Base

  content_type "text/html", "application/json"
  content_type /xml/
end

HTTP parameters in Content-Types are ignored for comparison

Content-Types are compared regardless of their parameters. For example, text/html; charset=UTF-8 is considered the same as text/html.

Handling errors

Only ActiveJob error handling is supported

Wayfarer exclusively supports ActiveJob's error handling. You cannot use message queue-specific error handling, for example error handling with sidekiq_options is unsupported. Otherwise batches get garbage-collected too early as Wayfarer instruments ActiveJob.

Wayfarer relies on ActiveJob's error handling methods:

retry_on to retry jobs a number of times on certain errors:

class DummyJob < Wayfarer::Base
  retry_on MyError, attempts: 3 do |job, error|
    # This block runs once all 3 attempts have failed
    # (1 initial attempt + 2 retries)
  end
end

discard_on to throw away jobs on certain errors:

class DummyJob < Wayfarer::Base
  discard_on MyError do |job, error|
    # This block runs once and buries the job
  end
end

Recreating user agents on certain errors

You can configure a list of exception classes upon which user agents get recreated (see User agent API):

Wayfarer.config[:network][:renew_on] = [MyIrrecoverableError]