User agent API
Wayfarer retrieves URL contents with user agents. It supports two types of user agents:
- stateless HTTP clients which handle redirects explicitly.
- stateful browsers which carry state and follow redirects implicitly as they navigate to a URL
| Stateless adapters | Stateful adapters | |
|---|---|---|
| interactive | no | yes |
| redirect handling | explicit | implicit |
Because spawning browser processes or instantiating HTTP clients is expensive, Wayfarer keeps user agents in a pool and reuses them across jobs. This means that browser state carries over between jobs, as each job checks out a previous job's user agent. When exceptions are raised, you must handle them.
Only on certain irrecoverable errors are individual user agents destroyed and recreated. For example when a browser process crashes, it is replaced with a fresh browser process.
Implementing the user agent interfaces
You implement both stateful and stateless agents by including the
Wayfarer::Networking::Strategy module and defining callback methods:
classDiagram
class BaseAgent {
<<Interface>>
+#create()*
+#destroy(instance)*
+::renew_on()$
}
class StatefulAgent {
<<Interface>>
+#navigate(instance, url)*
+#live(instance)*
}
class StatelessAgent {
<<Interface>>
+#fetch(instance, url)*
}
BaseAgent |>.. StatefulAgent : implements
BaseAgent |>.. StatelessAgent : implements
Every user agent implementation must provide the #create instance callback
which returns an initialized user agent. Typically, the optional
#destroy(instance) instance callback is also implemented to free resources
of an existing user agent.
Also a class method ::renew_on can be defined which returns an array of
exception classes upon which an instance of the user agent should get recreated
(destroy-and-create).
Stateless interface
In addition to the base interface, stateless user agents implement #fetch(instance, url)
which fetches pages or indicates redirects:
#create()(required)#fetch(instance, url)(required): Called to retrieve a URL. Responses with a 3xx status code must indicate the redirect URL by returningredirect(url), since Wayfarer deals with redirects on your behalf to avoid redirect loops. All other status codes, including 4xx and 5xx, are considered successful and are indicated by callingsuccess(url:, body:, status_code:, headers:).#destroy(instance)(optional)::renew_on(optional)
The stateless interface indicate HTTP 3xx redirect responses explicitly. This is how Wayfarer provides redirect handling out of the box, as there is a configurable limit on the number of retries to follow.
redirect(url) enqueues a task for the URL and stops processing the current task.
Pages with HTTP error status codes get routed
If a HTTP request to a URL results in an error status code (for example, 404), page retrieval is considered successful. This allows job actions to record such data.
Stateful interface
In addition to the base interface, stateful user agents implement two additional instance callbacks:
#create()(required)#navigate(instance, url)(required): Navigates the user agent to the given URL.#live(instance)(required): Turns the current user agent state into a page by callingsuccess(url:, body:, status_code:, headers:).#destroy(instance)(optional)::renew_on(optional)
Example implementations
class StatelessAgent
include Wayfarer::Networking::Strategy
def self.renew_on # optional
[MyIrrecoverableError]
end
def create # required
MyClient.new
end
def fetch(client, url) # required
response = client.get(url)
return redirect(response.redirect_url) if response.redirect?
success(url: url,
body: response.body,
status_code: response.status_code,
headers: response.headers)
end
def destroy(client) # optional
client.close
end
end
class StatefulAgent
include Wayfarer::Networking::Strategy
def self.renew_on # optional
[MyIrrecoverableError]
end
def create # required
MyBrowser.new
end
def navigate(browser, url) # required
browser.goto(url)
end
def live(browser) # required
success(url: browser.url,
body: browser.body,
status_code: browser.status_code,
headers: browser.headers)
end
def destroy(browser) # optional
browser.quit
end
end
Register and use the agent:
Wayfarer.config[:network][:agents][:my_agent] = MyAgent.new
Wayfarer.config[:network][:agent] = :my_agent