Design Doc: Alternative Data Sources¶

Status: Draft Date: 2026-04-15 Scope: src/data.py, api/, REFDATA

1. Overview¶

Extend the backtest pipeline with non-market-price data sources — physical and operational metrics that serve as leading/alternative indicators for equity and crypto strategies. Each provider will be integrated as a new class in src/data.py following the existing duck-typed interface (get_historical_price() → DataFrame ['t', 'v']).

┌──────────────────────────────────────────────────────────────────┐
│                        Backtest Pipeline                        │
│                                                                 │
│  Glassnode (on-chain) ──┐                                       │
│  FMP ──┤                                                        │
│  Nasdaq Data Link ──┤                                           │
│  MarineTraffic ──┼──► data.py ──► strat.py ──► perf.py         │
│  Aviationstack ──┤                                              │
│  (future: sat/foot traffic) ──┘                                 │
└──────────────────────────────────────────────────────────────────┘

2. Selection Criteria¶

#	Criterion	Required
C1	Non-market-price data — physical or operational metrics (vessel movements, flight counts, treasury rates, industrial production)	Yes
C2	Industry-specific indicators — tied to Maritime, Aviation, Macro, Industrial sectors	Yes
C3	10+ year historical depth — at least a decade of archive for long-term trend analysis	Yes
C4	Daily time interval — end-of-day resolution for consistent tracking	Yes
C5	Operational detail — granular enough to track physical activity (e.g. ship counts at a specific harbor)	Preferred
C6	Affordable for solo/small-team — under ~$100/mo for research-grade access	Preferred

3. Provider Evaluation¶

3.1 Priority 1a — Glassnode On-Chain Metrics (extend existing class)¶

Attribute	Detail
Data type	Crypto on-chain — SOPR, MVRV, active addresses, exchange net flows, NVT ratio, hash rate, supply in profit
Historical depth	10+ years (BTC data since 2009 genesis block)
Resolution	Daily (`24h`)
API limits	Tier-dependent; free tier covers ~20 core metrics
Cost	Free (limited) / $29/mo (Advanced) / $799/mo (Professional)
Key endpoints	`/v1/metrics/indicators/sopr`, `/v1/metrics/market/mvrv`, `/v1/metrics/addresses/active_count`, `/v1/metrics/transactions/transfers_volume_to_exchanges_sum`, `/v1/metrics/mining/hash_rate_mean`
Criteria met	C1 ✓ C2 ✓ C3 ✓ C4 ✓ C6 ✓ (free tier)

Why P1a: The Glassnode class already exists in data.py with auth, caching, and the ['t', 'v'] interface — it just needs new methods for on-chain metrics. Zero new dependencies, zero new API keys. On-chain data (SOPR, MVRV, exchange flows) is the canonical alternative dataset for crypto strategies and directly complements the price data Glassnode already provides.

Integration sketch (extend existing class):

# Add to existing Glassnode class in data.py

@lru_cache(maxsize=32)
def get_onchain_metric(self, metric_path, symbol, start_date, end_date, resolution='24h'):
    """Fetch any on-chain metric from Glassnode.

    Args:
        metric_path: API metric path (e.g. 'indicators/sopr',
                     'market/mvrv', 'addresses/active_count').
        symbol: Crypto asset (e.g. 'BTC').
        start_date: Start date (YYYY-MM-DD).
        end_date: End date (YYYY-MM-DD).
        resolution: Data interval. Defaults to '24h'.

    Returns:
        DataFrame with columns ['t', 'v'].
    """
    since = int(time.mktime(time.strptime(start_date, "%Y-%m-%d")))
    until = int(time.mktime(time.strptime(end_date, "%Y-%m-%d")))
    res = requests.get(
        f"https://api.glassnode.com/v1/metrics/{metric_path}",
        params={"a": symbol, "s": since, "u": until, "i": resolution},
        headers={"X-Api-Key": self.__api_key},
        timeout=30,
    )
    res.raise_for_status()
    df = pd.read_json(res.text, convert_dates=['t'])
    logger.info("Glassnode %s: fetched %d rows for %s", metric_path, len(df), symbol)
    return df

Backtest use cases: - SOPR < 1 as capitulation / buy signal - MVRV Z-Score for cycle top/bottom detection - Exchange net flow spike as sell pressure indicator - Active address growth as network health / trend confirmation - Hash rate drop as miner stress signal

3.2 Priority 1b — FMP (Financial Modeling Prep)¶

Attribute	Detail
Data type	Macro/Corporate — Treasury rates, economic indicators, financial ratios, sector performance
Historical depth	30+ years
Resolution	Daily
API limits	750 calls/min (Premium)
Cost	$49/mo (Premium)
Key endpoints	`/api/v3/treasury`, `/api/v3/economic`, `/api/v3/ratios/{symbol}`, `/api/v3/sector-performance`
Criteria met	C1 ✓ C2 ✓ C3 ✓ C4 ✓ C6 ✓

Why P1: Best cost-to-depth ratio. 30+ years of macro data at $49/mo. Treasury rates and economic indicators are immediately usable as features in multi-factor strategies. Clean REST API with JSON responses — fastest to integrate.

Integration sketch:

class FMP:
    """Retrieve macro/economic data from Financial Modeling Prep."""

    def __init__(self) -> None:
        load_dotenv()
        self.__api_key = os.getenv("FMP_API_KEY")
        if not self.__api_key:
            raise ValueError("FMP_API_KEY must be set in .env")

    @lru_cache(maxsize=32)
    def get_historical_price(self, symbol, start_date, end_date):
        # GET https://financialmodelingprep.com/api/v3/historical-price-full/{symbol}
        ...
        return pd.DataFrame({"t": dates, "v": values})

    @lru_cache(maxsize=32)
    def get_treasury_rate(self, start_date, end_date):
        # GET /api/v3/treasury?from={start}&to={end}
        ...
        return pd.DataFrame({"t": dates, "v": rates})

    @lru_cache(maxsize=32)
    def get_economic_indicator(self, indicator, start_date, end_date):
        # GET /api/v4/economic?name={indicator}&from={start}&to={end}
        ...
        return pd.DataFrame({"t": dates, "v": values})

Backtest use cases: - Treasury yield as regime filter (risk-on vs risk-off) - Sector performance as rotation signal - Financial ratios as value factor overlay

3.3 Priority 2 — Nasdaq Data Link (formerly Quandl)¶

Attribute	Detail
Data type	Industrial production, commodities, macro series
Historical depth	30–50+ years (dataset-dependent)
Resolution	Daily (most series)
API limits	Varies by dataset; 50 calls/day on free tier
Cost	Free (limited) to $50–100+/mo for premium datasets
Key datasets	`FRED/INDPRO` (Industrial Production), `CHRIS/CME_CL1` (Crude Oil), `FRED/DFF` (Fed Funds Rate)
Criteria met	C1 ✓ C2 ✓ C3 ✓ C4 ✓ C6 ✓

Why P2: Unmatched historical depth (50+ years for many FRED series). Free tier covers several core macro indicators. The nasdaqdatalink Python package provides a clean pandas interface — minimal HTTP plumbing needed.

Integration sketch:

class NasdaqDataLink:
    """Retrieve economic/industrial data from Nasdaq Data Link (Quandl)."""

    def __init__(self) -> None:
        load_dotenv()
        self.__api_key = os.getenv("NASDAQ_DATA_LINK_API_KEY")
        if not self.__api_key:
            raise ValueError("NASDAQ_DATA_LINK_API_KEY must be set in .env")

    @lru_cache(maxsize=32)
    def get_historical_price(self, dataset, start_date, end_date):
        # nasdaqdatalink.get(dataset, start_date=..., end_date=...)
        ...
        return pd.DataFrame({"t": dates, "v": values})

Backtest use cases: - Industrial production as economic cycle indicator - Crude oil price as inflation/energy sector proxy - Fed Funds Rate as monetary policy regime signal

3.4 Priority 3 — MarineTraffic¶

Attribute	Detail
Data type	Maritime — port calls, vessel tracks, ship counts per harbor
Historical depth	15+ years (archived since 2009)
Resolution	Daily (aggregated from event-level)
API limits	Varies by contract; credit-based system
Cost	~£10–£100+/mo depending on data scope
Key endpoints	`EV01` (Vessel Historical Track), `EV03` (Port Calls), `VI06` (Voyage Info)
Criteria met	C1 ✓ C2 ✓ C3 ✓ C4 ✓ C5 ✓

Why P3: Only provider that satisfies C5 (operational detail — ship counts at a specific harbor). 15+ year history is solid. However, credit-based pricing is opaque and per-vessel queries may be expensive at scale. Requires more complex aggregation logic (port-call events → daily ship counts).

Integration sketch:

class MarineTraffic:
    """Retrieve port call and vessel data from MarineTraffic API."""

    def __init__(self) -> None:
        load_dotenv()
        self.__api_key = os.getenv("MARINETRAFFIC_API_KEY")
        if not self.__api_key:
            raise ValueError("MARINETRAFFIC_API_KEY must be set in .env")

    @lru_cache(maxsize=32)
    def get_port_calls(self, port_id, start_date, end_date):
        # GET /exportportcalls/v:6/{api_key}/portid:{port_id}/...
        # Aggregate events → daily ship count
        ...
        return pd.DataFrame({"t": dates, "v": daily_ship_counts})

    @lru_cache(maxsize=32)
    def get_historical_price(self, symbol, start_date, end_date):
        # Wrapper: symbol = port_id, v = daily ship count
        return self.get_port_calls(symbol, start_date, end_date)

Backtest use cases: - Harbor ship count as trade-volume proxy (e.g. Shanghai, Rotterdam) - Container vessel frequency as supply-chain leading indicator - Tanker traffic at oil terminals as crude demand signal

3.5 Priority 4 — Aviationstack¶

Attribute	Detail
Data type	Aviation — daily flight counts, airport traffic
Historical depth	Typically <10 years (standard plans)
Resolution	Daily
API limits	10,000+ calls/mo (paid)
Cost	$49.99+/mo
Key endpoints	`/v1/flights` (historical), `/v1/airports`
Criteria met	C1 ✓ C2 ✓ C3 ✗ C4 ✓ C6 ✓

Why P4: Fails C3 (10+ year history) on standard plans. Useful for aviation sector analysis but the shallow archive limits long-term backtesting. Similar cost to FMP but far less historical depth.

Integration sketch:

class Aviationstack:
    """Retrieve daily flight data from Aviationstack."""

    def __init__(self) -> None:
        load_dotenv()
        self.__api_key = os.getenv("AVIATIONSTACK_API_KEY")
        if not self.__api_key:
            raise ValueError("AVIATIONSTACK_API_KEY must be set in .env")

    @lru_cache(maxsize=32)
    def get_historical_price(self, airport_iata, start_date, end_date):
        # Paginate /v1/flights?dep_iata={airport}&flight_date={date}
        # Aggregate → daily flight count
        ...
        return pd.DataFrame({"t": dates, "v": daily_flight_counts})

3.6 Deferred — Satellite & Foot Traffic¶

Provider	Data Type	Why Deferred
Planet Labs	Satellite imagery (ship/truck counts)	Enterprise pricing, complex CV pipeline needed
Unacast	Retail foot traffic	$1,000+/mo, mobile signal data has privacy concerns
Vizion API	Container tracking	Custom pricing, narrow logistics scope
SkyFi	On-demand satellite	Pay-per-image, not suited for daily time series
GrowthFactor	Retail foot traffic	$400–5,000/mo, US-focused
OAG	Institutional aviation stats	Institutional pricing, overlaps Aviationstack

These providers are either too expensive for solo research or require non-trivial processing pipelines (computer vision for satellite imagery). Revisit when the platform generates revenue or when a specific strategy demands this data.

4. Priority Ranking & Rationale¶

Rank	Provider	Cost	History	Effort	Score
P1a	Glassnode (on-chain)	$0–29/mo	10+ yr	Minimal — extend existing class	★★★★★
P1b	FMP	$49/mo	30+ yr	Low — clean REST JSON	★★★★★
P2	Nasdaq Data Link	$0–50/mo	50+ yr	Low — Python SDK	★★★★☆
P3	MarineTraffic	£10–100/mo	15+ yr	Medium — event aggregation	★★★☆☆
P4	Aviationstack	$50/mo	<10 yr	Medium — pagination + aggregation	★★☆☆☆
—	Satellite/Foot Traffic	$400–5K/mo	Varies	High — CV/enterprise	Deferred

Recommended order: Glassnode on-chain → FMP → Nasdaq Data Link → MarineTraffic → Aviationstack.

Rationale: Glassnode is the highest-ROI first step — the class already exists in data.py with auth and caching; adding get_onchain_metric() requires no new dependencies or API keys. It immediately unlocks SOPR, MVRV, and exchange flow signals for crypto strategies. FMP and Nasdaq Data Link follow as the best depth-to-cost ratio for macro overlays. MarineTraffic unlocks the unique harbor-tracking use case but needs event-to-daily aggregation. Aviationstack is lowest priority due to shallow history.

5. Integration Pattern¶

All new sources follow the existing data.py duck-typed interface:

class <Source>:
    def __init__(self) -> None:
        # Load API key from .env, raise ValueError if missing

    @lru_cache(maxsize=32)
    def get_historical_price(self, symbol, start_date, end_date) -> pd.DataFrame:
        # Returns DataFrame with columns ['t', 'v']
        # t = YYYY-MM-DD date strings
        # v = float metric values

Pipeline wiring¶

REFDATA registration: Add row to REFDATA.DATA_SOURCE table (new) or extend REFDATA.ASSET_TYPE with source-specific types.
main.py dispatch: Add source to the data-fetch dispatch logic.
API endpoint: GET /api/v1/data/{source}/{symbol} for frontend access.
env var: <SOURCE>_API_KEY in .env and SSM Parameter Store (/quant/{env}/<source>_api_key).

Rate limiting¶

Source	Limit	Strategy
Glassnode	Tier-dependent (free: ~200/day)	`@lru_cache` + combine metrics in one backtest run
FMP	750/min	Simple `time.sleep(0.08)` between calls
Nasdaq Data Link	50/day (free)	Cache aggressively; batch date ranges
MarineTraffic	Credit-based	Pre-aggregate port calls; minimize API hits
Aviationstack	10K/mo	One call per airport-day; local cache

6. REFDATA Changes¶

New rows needed when each source is implemented:

-- REFDATA.DATA_SOURCE (new table or extend ASSET_TYPE)
-- FMP
INSERT INTO REFDATA.DATA_SOURCE (SOURCE_NM, DISPLAY_NAME, BASE_URL, AUTH_TYPE)
VALUES ('FMP', 'Financial Modeling Prep', 'https://financialmodelingprep.com/api', 'API_KEY');

-- Nasdaq Data Link
INSERT INTO REFDATA.DATA_SOURCE (SOURCE_NM, DISPLAY_NAME, BASE_URL, AUTH_TYPE)
VALUES ('NASDAQ_DATA_LINK', 'Nasdaq Data Link', 'https://data.nasdaq.com/api/v3', 'API_KEY');

7. Open Questions¶

#	Question	Impact
Q1	Should alternative data metrics use the same `['t', 'v']` interface or extend to multi-column?	Interface design — affects all downstream modules
Q2	Store fetched alternative data in PostgreSQL for caching, or rely on `@lru_cache` only?	Cost control — reduces API calls on repeated backtests
Q3	How to handle mixed frequencies (some series skip weekends/holidays)?	Data alignment with market price series
Q4	Should MarineTraffic aggregation (events → daily counts) happen at fetch time or in `strat.py`?	Separation of concerns

8. Next Steps¶

Extend Glassnode class — add get_onchain_metric() method to existing class in data.py.
Add unit tests for on-chain metric fetch with mocked API responses.
Build proof-of-concept strategy — e.g. BTC price + SOPR capitulation signal or MVRV cycle filter.
Sign up for FMP API key — verify endpoint responses against docs.
Implement FMP class in data.py with treasury rate + economic indicator methods.
Add FMP unit tests with mocked API responses.
Build macro overlay strategy — e.g. BTC + Treasury yield regime filter.
Repeat for Nasdaq Data Link (P2) once FMP is validated.