Development
Release Procedure
- Bump versions:
Wayfarer::VERSIONinlib/wayfarer.rb- RubyGem version in
wayfarer.gemspec - Run
bundle installto regenerateGemfile.lock - Push to
masterordevelopand run the manualreleaseworflow
Conventions and guidelines
- In source code,
urlrefers to strings andurirefers toAddressable::URI - Avoid writing bash at all costs. Use Ruby instead
Design decisions and architecture
Navigate the web along URL patterns
URLs are less prone to change than served markup. There are also SEO incentives to keep their paths stable. Since websites naturally implement architectural URL patterns like REST on the application layer, URL structure reflects internal domain concepts necessarily, for which Wayfarer's URL-based routing is designed.
Follow URLs verbatim as they appear in responses
Normalized URLs are useful for deduplication, but URLs should be followed as they appear in responses. Navigating to normalized versions of URLs makes crawlers stick out from other user agents.
Tasks are version-less and don't persist metadata
Tasks serialize to their URL and batch. No other data gets written to the message queue. There is also no need for versioning persisted tasks, since there will be never more to a task than URL and batch. All task metadata is ephemeral.
Why depend on Redis
There are two core features that depend on Redis. First, per-batch acylicity is achieved by maintaining the set of processed URLs per batch in Redis. There's no configuration option to follow links in a cyclic manner. Second, batch completion requires updating an integer value in Redis, and batch completion is a very useful feature, since most crawls should end eventually, and often you want to know when.
No configuration files
Wayfarer can be configured through Wayfarer.config in Ruby code only,
because Wayfarer.config may contain Ruby objects that don't de/serialize well,
such as Procs or Sets.
Other features that are out of scope
Wayfarer won't provide:
- persistence or any sort of DOM data mapping abstractions
- URL generation helpers