Pages
A page is the immutable state of the contents behind a URL at a point in time, retrieved by a user agent. In other words, a page is an HTTP response, or the state of a remotely controlled browser.
class DummyJob < ActiveJob::Base
include Wayfarer::Base
route.to :index
def index
page # => #<Wayfarer::Page ...>
page.url # => "https://example.com"
page.body # => "<html>..."
page.status_code # => 200
page.headers # => { "content-type" => ... }
page.mime_type # => #<MIME::Type: text/html>
# The lazily parsed response body or `nil`, depending on the Content-Type
page.doc # => #<Nokogiri::HTML::Document ...>
# See: https://github.com/metainspector/metainspector
page.meta # => #<MetaInspector::Document ...>
# Examples:
page.meta.links.internal
page.meta.images.favicon
page.meta.description
page.meta.feeds
end
end
HTTP headers are downcased and case-sensitive
HTTP headers are downcased, so you would access
page.headers["content-type"] instead of page.headers["Content-Type"].
Response body parsing
Wayfarer parses the bodies of HTML, XML and JSON responses according to their MIME types:
application/htmltoNokogiri::HTML::Documenttext/xmlorapplication/xmltoNokogiri::XML::Documentapplication/jsontoHash
Implementing a custom response body parser
You can register an object that implements a #parse method for any MIME type:
class MyJPEGParser
def parse(body)
# Read EXIF metadata here.
# Return value is accessible as `page.doc`
end
end
Wayfarer::Parsing.registry["image/jpeg"] = MyJPEGParser.new
#parse must be thread-safe!
Handling responses without a Content-Type
If a response has no Content-Type header, Wayfarer falls back to
application/octet-stream. A parser registered for
application/octet-stream will hence also handle all responses without
a Content-Type.
Live pages
page initially returns a snapshot of the browser state
immediately after the user agent navigated to the URL. The browser state may
change significantly after the page was retrieved, for example due to your own
interaction, or client-side JavaScript altering the DOM or URL.
To get a page that reflects the current browser state, set the :live
keyword:
class DummyJob < Wayfarer::Worker
route.to :index
def index
page # => #<Wayfarer::Page ...>
# Fill in forms, click buttons, etc.
# Replaces the current Page object with a newer one,
# taking into account the DOM as currently rendered by the browser.
# Effectful only when automating browsers, no-op when using plain
# HTTP.
page(live: true)
page # => The live page returned above
end
end
Stateless user agents ignore :live
The :live option is ignored by stateless user agents, such as the
default :http user agent. Instead, stateless user agents always
return the same page object.
Accessing page metadata with MetaInspector
You have access to a MetaInspector document for accessing metadata of HTML pages. For example, to stage all links internal to the current hostname:
def index
stage page.meta.links.internal
end