Design Doc: Alternative Data Sources¶
Status: Draft
Date: 2026-04-15
Scope: src/data.py, api/, REFDATA
1. Overview¶
Extend the backtest pipeline with non-market-price data sources — physical
and operational metrics that serve as leading/alternative indicators for equity
and crypto strategies. Each provider will be integrated as a new class in
src/data.py following the existing duck-typed interface
(get_historical_price() → DataFrame ['t', 'v']).
┌──────────────────────────────────────────────────────────────────┐
│ Backtest Pipeline │
│ │
│ Glassnode (on-chain) ──┐ │
│ FMP ──┤ │
│ Nasdaq Data Link ──┤ │
│ MarineTraffic ──┼──► data.py ──► strat.py ──► perf.py │
│ Aviationstack ──┤ │
│ (future: sat/foot traffic) ──┘ │
└──────────────────────────────────────────────────────────────────┘
2. Selection Criteria¶
| # | Criterion | Required |
|---|---|---|
| C1 | Non-market-price data — physical or operational metrics (vessel movements, flight counts, treasury rates, industrial production) | Yes |
| C2 | Industry-specific indicators — tied to Maritime, Aviation, Macro, Industrial sectors | Yes |
| C3 | 10+ year historical depth — at least a decade of archive for long-term trend analysis | Yes |
| C4 | Daily time interval — end-of-day resolution for consistent tracking | Yes |
| C5 | Operational detail — granular enough to track physical activity (e.g. ship counts at a specific harbor) | Preferred |
| C6 | Affordable for solo/small-team — under ~$100/mo for research-grade access | Preferred |
3. Provider Evaluation¶
3.1 Priority 1a — Glassnode On-Chain Metrics (extend existing class)¶
| Attribute | Detail |
|---|---|
| Data type | Crypto on-chain — SOPR, MVRV, active addresses, exchange net flows, NVT ratio, hash rate, supply in profit |
| Historical depth | 10+ years (BTC data since 2009 genesis block) |
| Resolution | Daily (24h) |
| API limits | Tier-dependent; free tier covers ~20 core metrics |
| Cost | Free (limited) / $29/mo (Advanced) / $799/mo (Professional) |
| Key endpoints | /v1/metrics/indicators/sopr, /v1/metrics/market/mvrv, /v1/metrics/addresses/active_count, /v1/metrics/transactions/transfers_volume_to_exchanges_sum, /v1/metrics/mining/hash_rate_mean |
| Criteria met | C1 ✓ C2 ✓ C3 ✓ C4 ✓ C6 ✓ (free tier) |
Why P1a: The Glassnode class already exists in data.py with auth, caching,
and the ['t', 'v'] interface — it just needs new methods for on-chain metrics.
Zero new dependencies, zero new API keys. On-chain data (SOPR, MVRV, exchange
flows) is the canonical alternative dataset for crypto strategies and directly
complements the price data Glassnode already provides.
Integration sketch (extend existing class):
# Add to existing Glassnode class in data.py
@lru_cache(maxsize=32)
def get_onchain_metric(self, metric_path, symbol, start_date, end_date, resolution='24h'):
"""Fetch any on-chain metric from Glassnode.
Args:
metric_path: API metric path (e.g. 'indicators/sopr',
'market/mvrv', 'addresses/active_count').
symbol: Crypto asset (e.g. 'BTC').
start_date: Start date (YYYY-MM-DD).
end_date: End date (YYYY-MM-DD).
resolution: Data interval. Defaults to '24h'.
Returns:
DataFrame with columns ['t', 'v'].
"""
since = int(time.mktime(time.strptime(start_date, "%Y-%m-%d")))
until = int(time.mktime(time.strptime(end_date, "%Y-%m-%d")))
res = requests.get(
f"https://api.glassnode.com/v1/metrics/{metric_path}",
params={"a": symbol, "s": since, "u": until, "i": resolution},
headers={"X-Api-Key": self.__api_key},
timeout=30,
)
res.raise_for_status()
df = pd.read_json(res.text, convert_dates=['t'])
logger.info("Glassnode %s: fetched %d rows for %s", metric_path, len(df), symbol)
return df
Backtest use cases: - SOPR < 1 as capitulation / buy signal - MVRV Z-Score for cycle top/bottom detection - Exchange net flow spike as sell pressure indicator - Active address growth as network health / trend confirmation - Hash rate drop as miner stress signal
3.2 Priority 1b — FMP (Financial Modeling Prep)¶
| Attribute | Detail |
|---|---|
| Data type | Macro/Corporate — Treasury rates, economic indicators, financial ratios, sector performance |
| Historical depth | 30+ years |
| Resolution | Daily |
| API limits | 750 calls/min (Premium) |
| Cost | $49/mo (Premium) |
| Key endpoints | /api/v3/treasury, /api/v3/economic, /api/v3/ratios/{symbol}, /api/v3/sector-performance |
| Criteria met | C1 ✓ C2 ✓ C3 ✓ C4 ✓ C6 ✓ |
Why P1: Best cost-to-depth ratio. 30+ years of macro data at $49/mo. Treasury rates and economic indicators are immediately usable as features in multi-factor strategies. Clean REST API with JSON responses — fastest to integrate.
Integration sketch:
class FMP:
"""Retrieve macro/economic data from Financial Modeling Prep."""
def __init__(self) -> None:
load_dotenv()
self.__api_key = os.getenv("FMP_API_KEY")
if not self.__api_key:
raise ValueError("FMP_API_KEY must be set in .env")
@lru_cache(maxsize=32)
def get_historical_price(self, symbol, start_date, end_date):
# GET https://financialmodelingprep.com/api/v3/historical-price-full/{symbol}
...
return pd.DataFrame({"t": dates, "v": values})
@lru_cache(maxsize=32)
def get_treasury_rate(self, start_date, end_date):
# GET /api/v3/treasury?from={start}&to={end}
...
return pd.DataFrame({"t": dates, "v": rates})
@lru_cache(maxsize=32)
def get_economic_indicator(self, indicator, start_date, end_date):
# GET /api/v4/economic?name={indicator}&from={start}&to={end}
...
return pd.DataFrame({"t": dates, "v": values})
Backtest use cases: - Treasury yield as regime filter (risk-on vs risk-off) - Sector performance as rotation signal - Financial ratios as value factor overlay
3.3 Priority 2 — Nasdaq Data Link (formerly Quandl)¶
| Attribute | Detail |
|---|---|
| Data type | Industrial production, commodities, macro series |
| Historical depth | 30–50+ years (dataset-dependent) |
| Resolution | Daily (most series) |
| API limits | Varies by dataset; 50 calls/day on free tier |
| Cost | Free (limited) to $50–100+/mo for premium datasets |
| Key datasets | FRED/INDPRO (Industrial Production), CHRIS/CME_CL1 (Crude Oil), FRED/DFF (Fed Funds Rate) |
| Criteria met | C1 ✓ C2 ✓ C3 ✓ C4 ✓ C6 ✓ |
Why P2: Unmatched historical depth (50+ years for many FRED series). Free
tier covers several core macro indicators. The nasdaqdatalink Python package
provides a clean pandas interface — minimal HTTP plumbing needed.
Integration sketch:
class NasdaqDataLink:
"""Retrieve economic/industrial data from Nasdaq Data Link (Quandl)."""
def __init__(self) -> None:
load_dotenv()
self.__api_key = os.getenv("NASDAQ_DATA_LINK_API_KEY")
if not self.__api_key:
raise ValueError("NASDAQ_DATA_LINK_API_KEY must be set in .env")
@lru_cache(maxsize=32)
def get_historical_price(self, dataset, start_date, end_date):
# nasdaqdatalink.get(dataset, start_date=..., end_date=...)
...
return pd.DataFrame({"t": dates, "v": values})
Backtest use cases: - Industrial production as economic cycle indicator - Crude oil price as inflation/energy sector proxy - Fed Funds Rate as monetary policy regime signal
3.4 Priority 3 — MarineTraffic¶
| Attribute | Detail |
|---|---|
| Data type | Maritime — port calls, vessel tracks, ship counts per harbor |
| Historical depth | 15+ years (archived since 2009) |
| Resolution | Daily (aggregated from event-level) |
| API limits | Varies by contract; credit-based system |
| Cost | ~£10–£100+/mo depending on data scope |
| Key endpoints | EV01 (Vessel Historical Track), EV03 (Port Calls), VI06 (Voyage Info) |
| Criteria met | C1 ✓ C2 ✓ C3 ✓ C4 ✓ C5 ✓ |
Why P3: Only provider that satisfies C5 (operational detail — ship counts at a specific harbor). 15+ year history is solid. However, credit-based pricing is opaque and per-vessel queries may be expensive at scale. Requires more complex aggregation logic (port-call events → daily ship counts).
Integration sketch:
class MarineTraffic:
"""Retrieve port call and vessel data from MarineTraffic API."""
def __init__(self) -> None:
load_dotenv()
self.__api_key = os.getenv("MARINETRAFFIC_API_KEY")
if not self.__api_key:
raise ValueError("MARINETRAFFIC_API_KEY must be set in .env")
@lru_cache(maxsize=32)
def get_port_calls(self, port_id, start_date, end_date):
# GET /exportportcalls/v:6/{api_key}/portid:{port_id}/...
# Aggregate events → daily ship count
...
return pd.DataFrame({"t": dates, "v": daily_ship_counts})
@lru_cache(maxsize=32)
def get_historical_price(self, symbol, start_date, end_date):
# Wrapper: symbol = port_id, v = daily ship count
return self.get_port_calls(symbol, start_date, end_date)
Backtest use cases: - Harbor ship count as trade-volume proxy (e.g. Shanghai, Rotterdam) - Container vessel frequency as supply-chain leading indicator - Tanker traffic at oil terminals as crude demand signal
3.5 Priority 4 — Aviationstack¶
| Attribute | Detail |
|---|---|
| Data type | Aviation — daily flight counts, airport traffic |
| Historical depth | Typically <10 years (standard plans) |
| Resolution | Daily |
| API limits | 10,000+ calls/mo (paid) |
| Cost | $49.99+/mo |
| Key endpoints | /v1/flights (historical), /v1/airports |
| Criteria met | C1 ✓ C2 ✓ C3 ✗ C4 ✓ C6 ✓ |
Why P4: Fails C3 (10+ year history) on standard plans. Useful for aviation sector analysis but the shallow archive limits long-term backtesting. Similar cost to FMP but far less historical depth.
Integration sketch:
class Aviationstack:
"""Retrieve daily flight data from Aviationstack."""
def __init__(self) -> None:
load_dotenv()
self.__api_key = os.getenv("AVIATIONSTACK_API_KEY")
if not self.__api_key:
raise ValueError("AVIATIONSTACK_API_KEY must be set in .env")
@lru_cache(maxsize=32)
def get_historical_price(self, airport_iata, start_date, end_date):
# Paginate /v1/flights?dep_iata={airport}&flight_date={date}
# Aggregate → daily flight count
...
return pd.DataFrame({"t": dates, "v": daily_flight_counts})
3.6 Deferred — Satellite & Foot Traffic¶
| Provider | Data Type | Why Deferred |
|---|---|---|
| Planet Labs | Satellite imagery (ship/truck counts) | Enterprise pricing, complex CV pipeline needed |
| Unacast | Retail foot traffic | $1,000+/mo, mobile signal data has privacy concerns |
| Vizion API | Container tracking | Custom pricing, narrow logistics scope |
| SkyFi | On-demand satellite | Pay-per-image, not suited for daily time series |
| GrowthFactor | Retail foot traffic | $400–5,000/mo, US-focused |
| OAG | Institutional aviation stats | Institutional pricing, overlaps Aviationstack |
These providers are either too expensive for solo research or require non-trivial processing pipelines (computer vision for satellite imagery). Revisit when the platform generates revenue or when a specific strategy demands this data.
4. Priority Ranking & Rationale¶
| Rank | Provider | Cost | History | Effort | Score |
|---|---|---|---|---|---|
| P1a | Glassnode (on-chain) | $0–29/mo | 10+ yr | Minimal — extend existing class | ★★★★★ |
| P1b | FMP | $49/mo | 30+ yr | Low — clean REST JSON | ★★★★★ |
| P2 | Nasdaq Data Link | $0–50/mo | 50+ yr | Low — Python SDK | ★★★★☆ |
| P3 | MarineTraffic | £10–100/mo | 15+ yr | Medium — event aggregation | ★★★☆☆ |
| P4 | Aviationstack | $50/mo | <10 yr | Medium — pagination + aggregation | ★★☆☆☆ |
| — | Satellite/Foot Traffic | $400–5K/mo | Varies | High — CV/enterprise | Deferred |
Recommended order: Glassnode on-chain → FMP → Nasdaq Data Link → MarineTraffic → Aviationstack.
Rationale: Glassnode is the highest-ROI first step — the class already exists
in data.py with auth and caching; adding get_onchain_metric() requires no
new dependencies or API keys. It immediately unlocks SOPR, MVRV, and exchange
flow signals for crypto strategies. FMP and Nasdaq Data Link follow as the best
depth-to-cost ratio for macro overlays. MarineTraffic unlocks the unique
harbor-tracking use case but needs event-to-daily aggregation. Aviationstack is
lowest priority due to shallow history.
5. Integration Pattern¶
All new sources follow the existing data.py duck-typed interface:
class <Source>:
def __init__(self) -> None:
# Load API key from .env, raise ValueError if missing
@lru_cache(maxsize=32)
def get_historical_price(self, symbol, start_date, end_date) -> pd.DataFrame:
# Returns DataFrame with columns ['t', 'v']
# t = YYYY-MM-DD date strings
# v = float metric values
Pipeline wiring¶
- REFDATA registration: Add row to
REFDATA.DATA_SOURCEtable (new) or extendREFDATA.ASSET_TYPEwith source-specific types. main.pydispatch: Add source to the data-fetch dispatch logic.- API endpoint:
GET /api/v1/data/{source}/{symbol}for frontend access. - env var:
<SOURCE>_API_KEYin.envand SSM Parameter Store (/quant/{env}/<source>_api_key).
Rate limiting¶
| Source | Limit | Strategy |
|---|---|---|
| Glassnode | Tier-dependent (free: ~200/day) | @lru_cache + combine metrics in one backtest run |
| FMP | 750/min | Simple time.sleep(0.08) between calls |
| Nasdaq Data Link | 50/day (free) | Cache aggressively; batch date ranges |
| MarineTraffic | Credit-based | Pre-aggregate port calls; minimize API hits |
| Aviationstack | 10K/mo | One call per airport-day; local cache |
6. REFDATA Changes¶
New rows needed when each source is implemented:
-- REFDATA.DATA_SOURCE (new table or extend ASSET_TYPE)
-- FMP
INSERT INTO REFDATA.DATA_SOURCE (SOURCE_NM, DISPLAY_NAME, BASE_URL, AUTH_TYPE)
VALUES ('FMP', 'Financial Modeling Prep', 'https://financialmodelingprep.com/api', 'API_KEY');
-- Nasdaq Data Link
INSERT INTO REFDATA.DATA_SOURCE (SOURCE_NM, DISPLAY_NAME, BASE_URL, AUTH_TYPE)
VALUES ('NASDAQ_DATA_LINK', 'Nasdaq Data Link', 'https://data.nasdaq.com/api/v3', 'API_KEY');
7. Open Questions¶
| # | Question | Impact |
|---|---|---|
| Q1 | Should alternative data metrics use the same ['t', 'v'] interface or extend to multi-column? |
Interface design — affects all downstream modules |
| Q2 | Store fetched alternative data in PostgreSQL for caching, or rely on @lru_cache only? |
Cost control — reduces API calls on repeated backtests |
| Q3 | How to handle mixed frequencies (some series skip weekends/holidays)? | Data alignment with market price series |
| Q4 | Should MarineTraffic aggregation (events → daily counts) happen at fetch time or in strat.py? |
Separation of concerns |
8. Next Steps¶
- Extend
Glassnodeclass — addget_onchain_metric()method to existing class indata.py. - Add unit tests for on-chain metric fetch with mocked API responses.
- Build proof-of-concept strategy — e.g. BTC price + SOPR capitulation signal or MVRV cycle filter.
- Sign up for FMP API key — verify endpoint responses against docs.
- Implement
FMPclass indata.pywith treasury rate + economic indicator methods. - Add FMP unit tests with mocked API responses.
- Build macro overlay strategy — e.g. BTC + Treasury yield regime filter.
- Repeat for Nasdaq Data Link (P2) once FMP is validated.