v29.1: Data Pipeline Hardening

Date: June 07, 2026 Task: t_20aca7a9 Workspace: /home/avalonas/.hermes/GOURMET (dir)


I. EXECUTIVE SUMMARY

The GOURMET data pipeline was audited end-to-end and hardened. The core finding (confirming v27’s note) is that GOURMET had no working live data ingestion — Firecrawl was broken, yfinance was uninstalled, and the Living Oracle’s “external feed” was a deterministic sin() proxy, not real data. The pipeline ran entirely on deterministic V6 engine math with no external grounding.

v29.1 delivers a keyless, fallback-chained resilient data fetcher that pulls real market, news, and macro data with ≥2 independent sources per data type, plus an entity freshness validator and an offline data-quality dashboard.

Deliverables

#DeliverablePathStatus
1Data source audit reportGourmetVault/v29.1/reports/v29_1_data_pipeline_hardening.md (this) + predictions/data_source_audit.jsonDONE
2Resilient data fetcherGourmetVault/v29.1/scripts/resilient_fetcher.pyDONE — live-tested
3Source uptime probepredictions/source_uptime.json (via --selftest)DONE
4Entity freshness reportscripts/entity_freshness.pypredictions/entity_freshness_report.json + reports/entity_freshness_report.mdDONE — 107 entities
5Data quality dashboard (offline HTML)reports/data_quality_dashboard.htmlDONE

II. DATA SOURCE AUDIT

Every data source GOURMET has depended on, classified:

SourceData typeStatusNotes
Firecrawl API (cloud)web scrapeBROKENFIRECRAWL_API_KEY stale/removed from researcher .env; web.backend=firecrawl returns “Invalid token”.
Firecrawl (self-hosted monorepo)web scrapeOPTIONALMonorepo present at GOURMET/firecrawl/ but not running. Needs Docker.
yfinancemarketBROKENNot installed; v22 daily report failed VIX fetch.
Yahoo chart API (query1)marketACTIVEKeyless v8 chart endpoint. Primary.
Yahoo chart API (query2)marketACTIVEIndependent host. Secondary.
Stooq CSVmarketBROKENReturns HTML / rate-limits this host. Tertiary only.
Google News RSSnewsACTIVEKeyless RSS search. Primary.
CNBC RSSnewsACTIVEKeyless top-news RSS, client-filtered. Secondary.
US Treasury par-yield XMLmacroACTIVEKeyless data.treasury.gov Atom feed (UST10Y). Primary.
FRED CSV (DGS10)macroACTIVE (slow)Keyless fredgraph.csv. Functional but 13–25s latency on this host. Secondary.
Living Oracle external feedfusionREPLACEDliving_oracle_v24.py compute_external_score() used a deterministic sin() proxy. Now feedable by the resilient fetcher.
Entity Oracle signals (vault JSON)entityACTIVE107 unique entities across v21–v26 files, all refreshed 2026-06-05/06.

Summary: 6 active · 3 broken · 1 replaced · 1 optional (machine-readable in predictions/data_source_audit.json).

Key infrastructure facts (verified this run)

  • researcher/.env contains no FIRECRAWL_API_KEY (only Matrix + OpenRouter keys).
  • config.yaml still has web: backend: firecrawl — broken for keyless callers.
  • Installed libs: requests, pandas, numpy. Missing: yfinance, feedparser, firecrawl.
  • Therefore the fetcher uses only requests + stdlib (csv, xml.etree) — no new deps.

III. RESILIENT DATA FETCHER

File: GourmetVault/v29.1/scripts/resilient_fetcher.py

Design

  • No single point of failure: every data type has ≥2 independent live sources.
  • No required API keys: all primary paths are keyless.
  • Graceful degradation: live → alternate live → cache → deterministic (market/macro) / empty (news).
  • Provenance on every fetch: source, status (live|cache|deterministic|empty|error), fetched_at.

Fallback chains

Data typeSources (in order)Last-resort
marketyahoo_q1 → yahoo_q2 → stooqdeterministic proxy (flagged synthetic)
newsgoogle_news → cnbc_newsempty (flagged)
macrotreasury → freddeterministic proxy (flagged synthetic)

Live test results (2026-06-07)

market ^VIX  -> close 21.51 (2026-06-05)  source=yahoo_q1  status=live
news fed     -> 10 headlines              source=google_news status=live
macro        -> UST10Y 4.55%              source=treasury    status=live

Usage

python3 resilient_fetcher.py --type market --symbol "^VIX"
python3 resilient_fetcher.py --type news --query "federal reserve"
python3 resilient_fetcher.py --type macro
python3 resilient_fetcher.py --selftest   # probe all sources, write uptime report

Per-fetch good values are cached to predictions/fetch_cache/ so a later outage serves the last-known-good value with status=cache instead of failing.


IV. SOURCE UPTIME

--selftest probes every configured source and writes predictions/source_uptime.json with per-source up/down, latency, and a per-data-type resilience verdict.

Latest probe: all three data types resilient: true (market 2/3 up, news 2/2 up, macro 1/2 up + deterministic fallback). Stooq is consistently rate-limited from this host and FRED is slow (~13–25s) — both are non-critical because their data types have a healthy primary.


V. ENTITY FRESHNESS

File: scripts/entity_freshness.pypredictions/entity_freshness_report.json + reports/entity_freshness_report.md

Walks every entity_oracle_*signals*.json across the vault, builds the unique roster (keeping the highest version each entity appears in), and derives freshness from the source file mtime (the data has no per-entity timestamps).

Result: 107 unique entities, 0 stale (7-day threshold). The v25 file holds the 92-entity full roster (matching the task’s “92 entities”); v26 added 13 more (NVIDIA, AMD, Intel, Cisco, Oracle, energy + regional-trade entities) → 107. All source files were refreshed 2026-06-05/06, so the entire roster is fresh.


VI. DATA QUALITY DASHBOARD

File: reports/data_quality_dashboard.htmloffline-capable (all CSS inline, zero external assets / CDN / scripts; verified). Renders summary cards (active/broken/replaced sources, entity counts), a per-data-type resilience table, the live source-uptime table, the full source audit, and the entity freshness table. Regenerate with python3 scripts/build_dashboard.py after a fresh --selftest.


VII. RECOMMENDATIONS FOR v30

  1. Wire the fetcher into Living Oracle. Replace compute_external_score()’s sin() proxy with real VIX/news-sentiment/macro from resilient_fetcher.py. This is the single highest-value follow-up — predictions are currently ungrounded.
  2. Schedule a daily --selftest cron to keep source_uptime.json and the dashboard current, and alert when a data type drops below resilient: true.
  3. Decide on Firecrawl: either refresh the cloud key or remove web.backend: firecrawl from config.yaml so keyless callers don’t hit dead config. The fetcher makes Firecrawl non-essential.
  4. Add a real news-sentiment scorer (headline → polarity) so the news feed produces a numeric external signal, not just headlines.

Generated 2026-06-07 by GOURMET v29.1 Data Pipeline Hardening (t_20aca7a9).

← Back to Research