Skip to content

Development

Release Procedure

  1. Bump versions:
  2. Wayfarer::VERSION in lib/wayfarer.rb
  3. RubyGem version in wayfarer.gemspec
  4. Run bundle install to regenerate Gemfile.lock
  5. Push to master or develop and run the manual release worflow

Conventions and guidelines

  • In source code, url refers to strings and uri refers to Addressable::URI
  • Avoid writing bash at all costs. Use Ruby instead

Design decisions and architecture

URLs are less prone to change than served markup. There are also SEO incentives to keep their paths stable. Since websites naturally implement architectural URL patterns like REST on the application layer, URL structure reflects internal domain concepts necessarily, for which Wayfarer's URL-based routing is designed.

Follow URLs verbatim as they appear in responses

Normalized URLs are useful for deduplication, but URLs should be followed as they appear in responses. Navigating to normalized versions of URLs makes crawlers stick out from other user agents.

Tasks are version-less and don't persist metadata

Tasks serialize to their URL and batch. No other data gets written to the message queue. There is also no need for versioning persisted tasks, since there will be never more to a task than URL and batch. All task metadata is ephemeral.

Why depend on Redis

There are two core features that depend on Redis. First, per-batch acylicity is achieved by maintaining the set of processed URLs per batch in Redis. There's no configuration option to follow links in a cyclic manner. Second, batch completion requires updating an integer value in Redis, and batch completion is a very useful feature, since most crawls should end eventually, and often you want to know when.

No configuration files

Wayfarer can be configured through Wayfarer.config in Ruby code only, because Wayfarer.config may contain Ruby objects that don't de/serialize well, such as Procs or Sets.

Other features that are out of scope

Wayfarer won't provide:

  • persistence or any sort of DOM data mapping abstractions
  • URL generation helpers