v29.1: Data Pipeline Hardening
Date: June 07, 2026
Task: t_20aca7a9
Workspace: /home/avalonas/.hermes/GOURMET (dir)
I. EXECUTIVE SUMMARY
The GOURMET data pipeline was audited end-to-end and hardened. The core finding
(confirming v27’s note) is that GOURMET had no working live data ingestion —
Firecrawl was broken, yfinance was uninstalled, and the Living Oracle’s
“external feed” was a deterministic sin() proxy, not real data. The pipeline
ran entirely on deterministic V6 engine math with no external grounding.
v29.1 delivers a keyless, fallback-chained resilient data fetcher that pulls real market, news, and macro data with ≥2 independent sources per data type, plus an entity freshness validator and an offline data-quality dashboard.
Deliverables
| # | Deliverable | Path | Status |
|---|---|---|---|
| 1 | Data source audit report | GourmetVault/v29.1/reports/v29_1_data_pipeline_hardening.md (this) + predictions/data_source_audit.json | DONE |
| 2 | Resilient data fetcher | GourmetVault/v29.1/scripts/resilient_fetcher.py | DONE — live-tested |
| 3 | Source uptime probe | predictions/source_uptime.json (via --selftest) | DONE |
| 4 | Entity freshness report | scripts/entity_freshness.py → predictions/entity_freshness_report.json + reports/entity_freshness_report.md | DONE — 107 entities |
| 5 | Data quality dashboard (offline HTML) | reports/data_quality_dashboard.html | DONE |
II. DATA SOURCE AUDIT
Every data source GOURMET has depended on, classified:
| Source | Data type | Status | Notes |
|---|---|---|---|
| Firecrawl API (cloud) | web scrape | BROKEN | FIRECRAWL_API_KEY stale/removed from researcher .env; web.backend=firecrawl returns “Invalid token”. |
| Firecrawl (self-hosted monorepo) | web scrape | OPTIONAL | Monorepo present at GOURMET/firecrawl/ but not running. Needs Docker. |
| yfinance | market | BROKEN | Not installed; v22 daily report failed VIX fetch. |
| Yahoo chart API (query1) | market | ACTIVE | Keyless v8 chart endpoint. Primary. |
| Yahoo chart API (query2) | market | ACTIVE | Independent host. Secondary. |
| Stooq CSV | market | BROKEN | Returns HTML / rate-limits this host. Tertiary only. |
| Google News RSS | news | ACTIVE | Keyless RSS search. Primary. |
| CNBC RSS | news | ACTIVE | Keyless top-news RSS, client-filtered. Secondary. |
| US Treasury par-yield XML | macro | ACTIVE | Keyless data.treasury.gov Atom feed (UST10Y). Primary. |
| FRED CSV (DGS10) | macro | ACTIVE (slow) | Keyless fredgraph.csv. Functional but 13–25s latency on this host. Secondary. |
| Living Oracle external feed | fusion | REPLACED | living_oracle_v24.py compute_external_score() used a deterministic sin() proxy. Now feedable by the resilient fetcher. |
| Entity Oracle signals (vault JSON) | entity | ACTIVE | 107 unique entities across v21–v26 files, all refreshed 2026-06-05/06. |
Summary: 6 active · 3 broken · 1 replaced · 1 optional (machine-readable in
predictions/data_source_audit.json).
Key infrastructure facts (verified this run)
researcher/.envcontains noFIRECRAWL_API_KEY(only Matrix + OpenRouter keys).config.yamlstill hasweb: backend: firecrawl— broken for keyless callers.- Installed libs:
requests,pandas,numpy. Missing:yfinance,feedparser,firecrawl. - Therefore the fetcher uses only
requests+ stdlib (csv, xml.etree) — no new deps.
III. RESILIENT DATA FETCHER
File: GourmetVault/v29.1/scripts/resilient_fetcher.py
Design
- No single point of failure: every data type has ≥2 independent live sources.
- No required API keys: all primary paths are keyless.
- Graceful degradation: live → alternate live → cache → deterministic (market/macro) / empty (news).
- Provenance on every fetch:
source,status(live|cache|deterministic|empty|error),fetched_at.
Fallback chains
| Data type | Sources (in order) | Last-resort |
|---|---|---|
| market | yahoo_q1 → yahoo_q2 → stooq | deterministic proxy (flagged synthetic) |
| news | google_news → cnbc_news | empty (flagged) |
| macro | treasury → fred | deterministic proxy (flagged synthetic) |
Live test results (2026-06-07)
market ^VIX -> close 21.51 (2026-06-05) source=yahoo_q1 status=live
news fed -> 10 headlines source=google_news status=live
macro -> UST10Y 4.55% source=treasury status=live
Usage
python3 resilient_fetcher.py --type market --symbol "^VIX"
python3 resilient_fetcher.py --type news --query "federal reserve"
python3 resilient_fetcher.py --type macro
python3 resilient_fetcher.py --selftest # probe all sources, write uptime report
Per-fetch good values are cached to predictions/fetch_cache/ so a later outage
serves the last-known-good value with status=cache instead of failing.
IV. SOURCE UPTIME
--selftest probes every configured source and writes predictions/source_uptime.json
with per-source up/down, latency, and a per-data-type resilience verdict.
Latest probe: all three data types resilient: true (market 2/3 up,
news 2/2 up, macro 1/2 up + deterministic fallback). Stooq is consistently
rate-limited from this host and FRED is slow (~13–25s) — both are non-critical
because their data types have a healthy primary.
V. ENTITY FRESHNESS
File: scripts/entity_freshness.py → predictions/entity_freshness_report.json + reports/entity_freshness_report.md
Walks every entity_oracle_*signals*.json across the vault, builds the unique
roster (keeping the highest version each entity appears in), and derives freshness
from the source file mtime (the data has no per-entity timestamps).
Result: 107 unique entities, 0 stale (7-day threshold). The v25 file holds the 92-entity full roster (matching the task’s “92 entities”); v26 added 13 more (NVIDIA, AMD, Intel, Cisco, Oracle, energy + regional-trade entities) → 107. All source files were refreshed 2026-06-05/06, so the entire roster is fresh.
VI. DATA QUALITY DASHBOARD
File: reports/data_quality_dashboard.html — offline-capable (all CSS
inline, zero external assets / CDN / scripts; verified). Renders summary cards
(active/broken/replaced sources, entity counts), a per-data-type resilience table,
the live source-uptime table, the full source audit, and the entity freshness
table. Regenerate with python3 scripts/build_dashboard.py after a fresh
--selftest.
VII. RECOMMENDATIONS FOR v30
- Wire the fetcher into Living Oracle. Replace
compute_external_score()’ssin()proxy with real VIX/news-sentiment/macro fromresilient_fetcher.py. This is the single highest-value follow-up — predictions are currently ungrounded. - Schedule a daily
--selftestcron to keepsource_uptime.jsonand the dashboard current, and alert when a data type drops belowresilient: true. - Decide on Firecrawl: either refresh the cloud key or remove
web.backend: firecrawlfromconfig.yamlso keyless callers don’t hit dead config. The fetcher makes Firecrawl non-essential. - Add a real news-sentiment scorer (headline → polarity) so the news feed produces a numeric external signal, not just headlines.
Generated 2026-06-07 by GOURMET v29.1 Data Pipeline Hardening (t_20aca7a9).