Pages

A page is the immutable state of the contents behind a URL at a point in time, retrieved by a user agent. In other words, a page is an HTTP response, or the state of a remotely controlled browser.

class DummyJob < ActiveJob::Base
  include Wayfarer::Base

  route.to :index

  def index
    page # => #<Wayfarer::Page ...>

    page.url         # => "https://example.com"
    page.body        # => "<html>..."
    page.status_code # => 200
    page.headers     # => { "content-type" => ... }
    page.mime_type   # => #<MIME::Type: text/html>

    # The lazily parsed response body or `nil`, depending on the Content-Type
    page.doc # => #<Nokogiri::HTML::Document ...>

    # See: https://github.com/metainspector/metainspector
    page.meta # => #<MetaInspector::Document ...>
    # Examples:
    page.meta.links.internal
    page.meta.images.favicon
    page.meta.description
    page.meta.feeds
  end
end

HTTP headers are downcased and case-sensitive

HTTP headers are downcased, so you would access page.headers["content-type"] instead of page.headers["Content-Type"].

Response body parsing

Wayfarer parses the bodies of HTML, XML and JSON responses according to their MIME types:

application/html to Nokogiri::HTML::Document
text/xml or application/xml to Nokogiri::XML::Document
application/json to Hash

Implementing a custom response body parser

You can register an object that implements a #parse method for any MIME type:

class MyJPEGParser
  def parse(body)
    # Read EXIF metadata here.
    # Return value is accessible as `page.doc`
  end
end

Wayfarer::Parsing.registry["image/jpeg"] = MyJPEGParser.new

#parse must be thread-safe!

Handling responses without a Content-Type

If a response has no Content-Type header, Wayfarer falls back to application/octet-stream. A parser registered for application/octet-stream will hence also handle all responses without a Content-Type.

Live pages

page initially returns a snapshot of the browser state immediately after the user agent navigated to the URL. The browser state may change significantly after the page was retrieved, for example due to your own interaction, or client-side JavaScript altering the DOM or URL.

To get a page that reflects the current browser state, set the :live keyword:

class DummyJob < Wayfarer::Worker
  route.to :index

  def index
    page # => #<Wayfarer::Page ...>

    # Fill in forms, click buttons, etc.

    # Replaces the current Page object with a newer one,
    # taking into account the DOM as currently rendered by the browser.
    # Effectful only when automating browsers, no-op when using plain
    # HTTP.
    page(live: true)

    page # => The live page returned above
  end
end

Stateless user agents ignore :live

The :live option is ignored by stateless user agents, such as the default :http user agent. Instead, stateless user agents always return the same page object.

Accessing page metadata with MetaInspector

You have access to a MetaInspector document for accessing metadata of HTML pages. For example, to stage all links internal to the current hostname:

def index
  stage page.meta.links.internal
end