Jobs
Jobs are Active Jobs that use a DSL to process tasks that they read from a message queue.
Instead of implementing Active Job's #perform method yourself, you declare
routes to instance methods, like web applications route incoming
requests. Only URLs that match a route are retrieved and processed. All other
URLs are considered successfully processed. The action has access to the
retrieved page, the user agent that retrieved the
page and the current task:
class DummyJob < ActiveJob::Base
include Wayfarer::Base
route.to :index
def index
task # => #<Wayfarer::Task>
page # => #<Wayfarer::Page>
user_agent # => Browser or HTTP client
end
end
You can start a crawl by appending a task to the message queue for the URL with
::crawl. If you don't provide a batch, Wayfarer generates a UUID:
task = DummyJob.crawl("https://example.com")
# => #<Wayfarer::Task url="https://example.com", batch="498a13e0-...">
This is exactly the same as calling Active Job's #perform_later and passing a
task directly:
task = Wayfarer::Task.new("https://example.com", SecureRandom.uuid)
DummyJob.perform_later(task)
Instead of a generated UUID, you can also set your own batch:
DummyJob.crawl("https://example.com", batch: "my-batch")
You can also use Wayfarer's CLI to enqueue a task:
wayfarer enqueue --batch my-batch DummyJob "https://example.com"
Following URLs
Jobs navigate crawls by staging URLs with stage(urls). When you stage a URL,
it is appended verbatim to an internal set. Once the action returns, all URLs
in the set are appended as tasks to the message queue.
class DummyJob < ActiveJob::Base
include Wayfarer::Base
route.to :index
def index
# Follow all out-going links of the page
stage page.meta.links.external
end
end
Accessing the current task
If the task's URL matched a route, the URL is retrieved over the network,
and the method that was routed to is called. The task is available as #task:
class DummyJob < ActiveJob::Base
include Wayfarer::Base
route.to :index
def index
task.url # => "https://example.com"
task.batch # => "my-batch"
end
end
Accessing the current page
You have access to the retrieved page:
class DummyJob < ActiveJob::Base
include Wayfarer::Base
route.to :index
def index
page.url # => "https://example.com"
page.body # => "<html>..."
page.status_code # => 200
page.headers # { "Content-Type" => ... }
page.doc # Only present for certain Content-Types
end
end
Routing URLs to methods and extracting params
Jobs have a routing DSL that allows you to map URLs to methods and extract URL data:
class DummyJob < ActiveJob::Base
include Wayfarer::Base
route do
path "/users/:id/profile", to: :index
end
def index
params[:id] # => "42"
end
end
DummyJob.crawl("https://example.com/users/42/profile?foo=bar")
Controlling the user agent
You can control the browser or HTTP client that retrieved the page:
Wayfarer.config[:network][:agent] = :ferrum # Chrome DevTools Protocol
class DummyJob < ActiveJob::Base
include Wayfarer::Base
route.to :index
def index
user_agent.save_screenshot("capture.png")
end
end
Restricting the processed Content-Types
By default, jobs process pages regardless of their Content-Type response header. You can allow a list of Content-Types as strings and Regexps and opt out of the default behaviour. Once at least one Content-Type is allowed, other Content-Types don't get processed:
class DummyJob < ActiveJob::Base
include Wayfarer::Base
content_type "text/html", "application/json"
content_type /xml/
end
HTTP parameters in Content-Types are ignored for comparison
Content-Types are compared regardless of their parameters. For example,
text/html; charset=UTF-8 is considered the same as text/html.
Handling errors
Only ActiveJob error handling is supported
Wayfarer exclusively supports ActiveJob's error handling. You cannot use
message queue-specific error handling, for example error handling with
sidekiq_options is unsupported. Otherwise batches get garbage-collected
too early as Wayfarer instruments ActiveJob.
Wayfarer relies on ActiveJob's error handling methods:
-
retry_onto retry jobs a number of times on certain errors:class DummyJob < Wayfarer::Base retry_on MyError, attempts: 3 do |job, error| # This block runs once all 3 attempts have failed # (1 initial attempt + 2 retries) end end -
discard_onto throw away jobs on certain errors:class DummyJob < Wayfarer::Base discard_on MyError do |job, error| # This block runs once and buries the job end end
Recreating user agents on certain errors
You can configure a list of exception classes upon which user agents get recreated (see User agent API):
Wayfarer.config[:network][:renew_on] = [MyIrrecoverableError]