Skip to content

Tasks

Tasks are the immutable units of work read from a message queue and processed by jobs. A task consists of two strings:

  • The URL to process
  • The batch the task belongs to

A job processing a task commonly appends more tasks to the queue in turn.

Task URLs are not normalized

The URL returned by task.url is not normalized but verbatim as it was staged or enqueued.

Task deduplication

Wayfarer ensures that no URL gets processed twice within a batch. It achieves this by maintaining a Redis hash keyed by normalized URLs.

Wayfarer computes a canonical URL representation that it uses for cache lookups.

URL normalization

Wayfarer parses URLs with Addressable and applies further normalizations. By default, all normalizations are applied and can be individually disabled.

URL normalization is used only for deduplication, and does not affect the immutable task.url, which always returns the verbatim URL as enqueued. This allows you to follow the URLs exactly as parsed from response bodies.

You can configure the global normalization behaviour by setting the following values on Wayfarer.config.normalization do which all default to true:

  • remove_www: Remove www. prefix from hostnames?
  • remove_trailing_slash: Remove a trailing path slash?
  • remove_fragment: Remove the URL fragment?
  • order_query_parameters: Order query parameters alphabetically?
  • remove_tracking_parameters: Remove tracking parameters from the URL?

When a job gets deduplicated, it succeeds and causes no retries.

Setting a custom key function

You can customize how deduplication keys are computed. As a derived example, to process only one job per hostname:

Wayfarer.config[:deduplication][:key] = ->(task) { task[:uri].hostname }

Invalid URLs

Tasks with invalid URLs are discarded (for exampleht%0atp://localhost/ which has a newline in its protocol), since there is no corrective action possible. No exception is raised, and the job is considered successfully processed without retries.