User agents
User agents are used by jobs to retrieve the contents behind a URL into a page, for example a remotely controlled Firefox process or a Ruby HTTP client.
User agents are kept in a connection pool and all user agents in the pool share the same type and configuration. You can add your own custom user agents by implementing the user agent API.
Wayfarer comes with the following built-in user agents:
:http(default):ferrumto automate Google Chrome:seleniumto automate a variety of browsers:capybarato use Capybara sessions
Configure the user agent with the global configuration option:
Wayfarer.config[:network][:agent] = :ferrum # or :selenium, :capybara, ...
You can access the user agent that was checked out from the pool with
#user_agent in action methods:
class DummyJob < ActiveJob::Base
include Wayfarer::Base
route.to :index
def index
user_agent # => #<Ferrum::Browser ...>
end
end
You can also implement custom user agents to support your own HTTP client or browser automation service/protocol.
Ad-hoc HTTP requests
Regardless the configured user agent, you can always make ad-hoc HTTP GET requests
that return pages with #fetch(url):
class DummyJob < ActiveJob::Base
include Wayfarer::Base
route.to :index
def index
page = fetch("https://example.com") # => #<Wayfarer::Page ...>
end
end
#fetch respects Wayfarer.config.network.http_headers for all provided user agents.
HTTP request headers
You can set HTTP request headers for all built-in user agents:
Wayfarer.config[:network][:http_headers] = { "User-Agent" => "MyCrawler" }
Selenium does not support configuring HTTP request headers.
Connection pooling
Since user agents are expensive to create, especially in the case of browser processes, Wayfarer keeps user agents within a connection pool. When a job performs and needs to retrieve the page for its task URL, an agent is checked out from the pool, and checked back in when the routed action method returns.
The pool size is constant and it should equal the number of threads the underlying message queue operates with. For example, if you use Sidekiq, you should set the pool size to the number of Sidekiq threads:
Wayfarer.config[:network][:pool][:size] = Sidekiq.options[:concurrency]
The connection pool size is 1 by default
Since there is no reliable way to detect the number of threads that the underlying message queue operates with, Wayfarer defaults to a pool size of 1, which creates a bottleneck in a concurrent environment.
Browser sessions are shared across jobs
The same browser session is used across jobs. This means that the browser is not closed between jobs, and that the browser's state carries over from job to job. You may account for this by resetting the browser's state according to your needs, for which you can use callbacks.
UserAgentTimeoutError: avoiding pool contention
If you encounter UserAgentTimeoutError exceptions, a job has waited for a
user agent to become available for too long. By default, this timeout is 10
seconds. This is a sign that the pool size is too small for the message queue's
concurrency.
#<Wayfarer::UserAgentTimeoutError: Waited 10 sec, 0/1 available>
You can configure the timeout, although you will likely want to increase the pool size instead:
Wayfarer.config[:network][:pool][:timeout] = 10 # seconds