Tasks
Tasks are the immutable units of work read from a message queue and processed by jobs. A task consists of two strings:
- The URL to process
- The batch the task belongs to
A job processing a task commonly appends more tasks to the queue in turn.
Task URLs are not normalized
The URL returned by task.url is not normalized but verbatim
as it was staged or enqueued.
Task deduplication
Wayfarer ensures that no URL gets processed twice within a batch. It achieves this by maintaining a Redis hash keyed by normalized URLs.
Wayfarer computes a canonical URL representation that it uses for cache lookups.
URL normalization
Wayfarer parses URLs with Addressable and applies further normalizations. By default, all normalizations are applied and can be individually disabled.
URL normalization is used only for deduplication, and does not affect the immutable
task.url, which always returns the verbatim URL as enqueued.
This allows you to follow the URLs exactly as parsed from response bodies.
You can configure the global normalization behaviour by setting the following
values on Wayfarer.config.normalization do which all default to true:
remove_www: Removewww.prefix from hostnames?remove_trailing_slash: Remove a trailing path slash?remove_fragment: Remove the URL fragment?order_query_parameters:Order query parameters alphabetically?remove_tracking_parameters: Remove tracking parameters from the URL?
When a job gets deduplicated, it succeeds and causes no retries.
Setting a custom key function
You can customize how deduplication keys are computed. As a derived example, to process only one job per hostname:
Wayfarer.config[:deduplication][:key] = ->(task) { task[:uri].hostname }
Invalid URLs
Tasks with invalid URLs are discarded (for exampleht%0atp://localhost/ which has a
newline in its protocol), since there is no corrective action possible.
No exception is raised, and the job is considered successfully processed without retries.