Web Monitoring

Active

Website Watcher

Production web archiver with automated crawling, SHA-256 change detection, full-text search, and cryptographic verification. 26 Python modules with Docker orchestration.

Source Code
PythonFlaskSQLiteFTS5ArchiveBoxDocker

Pipeline

Discover
Archive
Detect
Search

How It Works

4-Stage Pipeline

1

Discover

Sitemap parsing + internal link extraction with robots.txt respect

2

Archive

ArchiveBox integration with retry logic and conditional GET

3

Detect

SHA-256 content hashing for version comparison and diff tracking

4

Search

SQLite FTS5 full-text search with faceting by site and date

Capabilities

What It Does

Automatic sitemap + internal link discovery with configurable depth
ArchiveBox page archiving with retry logic (up to 3 attempts)
SHA-256 content hashing — change detection without false positives
SQLite FTS5 full-text search with faceting and pagination
Flask web UI for site management, browsing, and search
APScheduler with configurable 2-hour crawl intervals
HMAC-SHA256 cryptographic signing for page version integrity
Prometheus metrics endpoint for monitoring

Infrastructure

Production Stack

Docker Compose

Multi-service: crawler, Prometheus, Grafana, IPFS

Prometheus

Metrics collection with search request counters

Systemd

Timer units for production auto-restart

Crypto

HMAC-SHA256 signing + Merkle tree scaffolding

jayhemnani9910/webcrawler
26 ModulesFTS5 SearchSHA-256 Hashing