Ingest USDA Export Sales into Commodity ETL

Practical guide with code to fetch, parse, normalize, and merge USDA export sales with corn, soy, and wheat price series for production ETL.

Hook: Stop hunting for clean USDA export numbers — automate the ingest

If you build commodity dashboards or trading signals, you know the pain: USDA export sales are essential but arrive in inconsistent formats, reported weekly, and need rapid alignment with price time series to be actionable. This guide gives a practical, code-first ETL pattern for ingesting USDA export sales for corn, soybeans, and wheat, normalizing units, merging with price series, and loading into a reproducible analytics pipeline.

Executive summary — what you’ll build

In this tutorial you will learn how to:

Locate and fetch USDA export sales reports using API, HTML, and PDF techniques.
Parse and normalize weekly export volumes into consistent units (metric tons and bushels).
Align export volumes with price time series for corn, soybeans, and wheat.
Operationalize the ETL in a reproducible way (local, cloud, or Airflow/DBT).

Why this matters in 2026

Government data modernization throughout 2024–2025 drove more machine-readable endpoints for many agencies, but USDA export reports still come in mixed formats. In 2026, commodity analysts combine USDA disclosures with higher-frequency data (satellite crop indices, vessel tracking, exchange futures) to get a faster, evidence-first read on supply/demand. Automating ingest reduces manual error, speeds model refreshes, and improves reproducibility.

Sources and formats you’ll encounter

Common USDA export sales sources and formats:

Weekly Export Sales (FAS) published as HTML pages or PDF bulletins.
Bulk CSV/JSON feeds exposed on agency data portals or S3 (availability varies).
Third-party APIs or data vendors offering normalized feeds (paid).

Design principle: code for multiple input formats (API, HTML, PDF) and canonicalize into one internal schema.

Step 1 — Define the canonical schema

For commodity analytics keep a simple, auditable schema for each weekly record:

date_week_start (ISO) -- date the USDA week starts (or official report date)
commodity -- one of {corn, soybeans, wheat}
export_mt -- exported volume in metric tons
export_bushels -- exported volume in bushels
destination_country -- ISO country code or 'UNKNOWN'
report_type -- weekly / private / revisions
source_url -- raw source location
ingest_timestamp -- when your system pulled the record

Store raw source files (PDF/HTML/JSON) and a parsed row for provenance.

Step 2 — Fetching data: adaptable patterns

Pattern 1: Producer API (preferred)

If USDA provides a JSON/CSV endpoint, use a robust HTTP client with retries and exponential backoff. Respect rate limits and use conditional GET (If-Modified-Since/Etag) where offered.

# Generic robust fetch example (requests + backoff)
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

session = requests.Session()
retries = Retry(total=5, backoff_factor=0.5, status_forcelist=[429,500,502,503,504])
session.mount('https://', HTTPAdapter(max_retries=retries))

url = 'https://example-usda-endpoint.example/api/export-sales?commodity=CORN&year=2026'
resp = session.get(url, timeout=30)
resp.raise_for_status()
data = resp.json()

Pattern 2: HTML scraping

When data is embedded in an HTML table or as text on a report page, use BeautifulSoup to extract rows. Always cache the HTML and save raw markup for debugging.

# Minimal HTML scrape
import requests
from bs4 import BeautifulSoup

r = requests.get('https://example-usda-page.example/weekly-export-sales')
soup = BeautifulSoup(r.text, 'html.parser')
# find table by id/class, then loop rows
table = soup.find('table', {'id':'export-sales-table'})
for tr in table.find_all('tr')[1:]:
    cols = [c.get_text(strip=True) for c in tr.find_all('td')]
    # cols -> parse into canonical schema

Pattern 3: PDF parsing (common)

USDA sometimes distributes bulletins in PDF. Use pdfplumber or tabula to extract tabular sections. PDFs are brittle—add heuristics and manual QA checks.

# PDF parse example with pdfplumber
import pdfplumber

with pdfplumber.open('weekly_report.pdf') as pdf:
    first_page = pdf.pages[0]
    tables = first_page.extract_tables()
    # Add heuristics: look for rows with commodity names and numeric patterns

Step 3 — Normalizing units: mt ↔ bushels

USDA often reports metric tons (MT). Many trading and analytics workflows require bushels. Use constant conversion factors and record them in metadata.

Corn: 1 metric ton = 39.368254 bushels (1 bushel corn = 25.4012 kg = 56 lb)
Soybeans: 1 metric ton = 36.743709 bushels (1 bushel = 27.2155 kg = 60 lb)
Wheat: 1 metric ton = 36.743709 bushels (1 bushel = 27.2155 kg = 60 lb)

# normalize function
CONV = {'corn': 39.368254, 'soybeans': 36.743709, 'wheat': 36.743709}

def mt_to_bushels(commodity, mt):
    return mt * CONV[commodity]

# Example
mt = 500_000  # metric tons
print(mt_to_bushels('corn', mt))

Step 4 — Aligning USDA weeks with price time series

USDA weekly tables typically represent the reporting week ending on a Thursday (verify for the dataset you consume). Price time series (futures) are usually daily. Common alignment methods:

Use the USDA report date as the weekly anchor and compute weekly aggregates of daily prices (mean, VWAP if you have volumes).
Use weekly close (Friday) or end-of-week futures contract settlement price to represent the week.
Keep both daily and weekly merged views for flexible analysis.

# Example: join weekly USDA exports with daily prices (pandas)
import pandas as pd

# prices.csv: date, price_close
prices = pd.read_csv('prices_corn.csv', parse_dates=['date'])
# exports.csv: week_start, commodity, export_mt
exports = pd.read_csv('usda_exports.csv', parse_dates=['week_start'])

# compute weekly mean price anchored on week_start
prices['week_start'] = prices['date'] - pd.to_timedelta(prices['date'].dt.dayofweek - 3, unit='d')
# above aligns daily dates to the Thursday of that week (adjust as needed)
weekly_price = prices.groupby('week_start').price_close.mean().reset_index()

merged = exports.merge(weekly_price, on='week_start', how='left')

Handling futures with roll adjustments

If you use exchange continuous futures (CME), prefer a roll-adjusted series (back-adjusted or ratio-adjusted) so your price series has consistent continuity. Many vendors provide continuous contracts on Nasdaq Data Link (formerly Quandl), or you can construct your own roll using volume/open interest heuristics.

Step 5 — Useful analyses once merged

Season-to-date (STY) exports: cumulative exports for the marketing year, compared to previous seasons.
Export intensity: weekly exports per week of the season.
Price/flow correlation: rolling correlation between weekly exports and futures prices with lags to detect lead-lag relationships.
Shipments by country: allocate and map where exports are going; useful to detect demand shifts.

# season-to-date example
exports['export_bushels'] = exports.apply(lambda r: mt_to_bushels(r['commodity'], r['export_mt']), axis=1)
exports = exports.sort_values(['commodity','week_start'])
exports['sod_cum_bushels'] = exports.groupby('commodity').export_bushels.cumsum()

# merge with price and compute correlation
df = exports.merge(weekly_price, on='week_start')
df['price_lag1'] = df.groupby('commodity').price_close.shift(1)
# rolling 8-week correlation between exports and lagged price
corr = df.groupby('commodity').apply(lambda g: g['export_bushels'].rolling(8).corr(g['price_lag1']))

Step 6 — Persisting and operationalizing the ETL

Storage options:

Parquet files in an S3 bucket (cheap, versionable with prefixes).
Postgres or TimescaleDB for indexed time series queries.
Cloud warehouses (BigQuery, Snowflake) for large-scale joins with other datasets.

Airflow DAG skeleton (concept)

# Pseudocode DAG steps
1. fetch_raw_sources -> store raw (HTML/PDF/JSON)
2. parse_raw -> produce canonical rows
3. validate_rows -> sanity checks (non-negative, consistent totals)
4. transform -> unit conversions, destination mapping
5. load -> upsert into Postgres / write parquet
6. postproc -> recompute aggregates, notify downstream

Best practices, edge cases and pitfalls

Provenance: always store source_url, raw payload, and parsing metadata so you can trace back anomalies.
Reconciliations: compare your weekly totals against USDA summary tables to detect parsing errors.
Unit ambiguity: some weeks report in metric tons, others in bushels—never assume; inspect units.
Private sales vs. confirmed shipments: separate these in your schema and analyses; they have different lead times and certainty.
Missing weeks: backfill from archived files or vendor feeds; mark imputations explicitly.
Time alignment: define a single canonical week convention for the pipeline and document it.

“Automating bulk ingest without provenance is faster but not safer — keep raw saves and clear transformation logs.”

Example: from ingest to insight — a small workflow

Schedule weekly job on Thursday morning to pull new USDA weekly release.
If JSON available: ingest directly; otherwise, pull HTML/PDF and parse.
Normalize exports to metric tons and bushels; tag by commodity and destination.
Aggregate week-start and join to weekly-rolled futures prices.
Compute season-to-date exports and 8-week rolling correlation with lagged price.
Push results to BI dashboard and trigger an alert if week-over-week change exceeds thresholds.

Sample alert logic (pseudo)

if (weekly_exports / mean(last_4_weeks) > 2):
    trigger_alert('Spike in exports for {commodity}')

if (sod_cum_bushels < historic_sod_5yr_percentile):
    flag('Lower than 5-year seasonal pace')

2026 trends to incorporate

Increased availability of cloud-hosted public datasets and API-first government catalogs — check Data.gov and USDA developer resources for JSON or S3 endpoints first.
Higher-resolution complementary inputs: vessel trackers, port unloads, and satellite-based yield indicators — enrich export data with these to detect true flow vs. paperwork.
Adoption of data contracts and schema registries in cloud-native ETLs to lock expectations for export rows and reduce pipeline breakages.

Methodology notes & limitations

USDA weekly export reports are a leading indicator but not perfect. They report registrations and confirmations that may be revised. Some important cautions:

Private sales figures can be reported before shipments occur.
Revision risk: USDA periodically revises past weeks — design your pipeline to handle updated rows and re-compute aggregates.
Destination codes and country names may change; maintain a mapping table and audit it quarterly.

Where to get price time series

Price sources vary by budget and compliance needs:

Exchange vendors (CME/ICE) — authoritative but often paid.
Nasdaq Data Link / Quandl — convenient for continuous futures.
Free sources for prototypes: Yahoo Finance (limited for commodities), Stooq, or vendor CSVs.

Always validate that the price ticker corresponds to the contract series (nearby/second nearby/continuous) that your strategy expects.

Quick reference checklist before production

Save raw reports and parsed outputs.
Record conversion constants in metadata.
Implement retries, backoff, and rate limit handling.
Write unit tests for parsing logic using archived PDFs/HTML files.
Build reprocessing/backfill job to fix parsing improvements.
Monitor weekly differences vs USDA summary reports and alert on mismatches.

Appendix — Useful code snippets

Upsert to Postgres (psycopg)

# upsert example
import psycopg2
conn = psycopg2.connect(...)
cur = conn.cursor()
cur.execute('''
INSERT INTO usda_export_sales(week_start, commodity, export_mt, export_bushels, destination_country, source_url, raw_payload)
VALUES (%s,%s,%s,%s,%s,%s,%s)
ON CONFLICT (week_start, commodity, destination_country) DO UPDATE SET
  export_mt = EXCLUDED.export_mt,
  export_bushels = EXCLUDED.export_bushels,
  source_url = EXCLUDED.source_url,
  raw_payload = EXCLUDED.raw_payload;
''', (week_start, commodity, mt, bushels, country, source_url, json_payload))
conn.commit()

Simple testing approach for PDFs

Store a few historical PDFs in /test_data and run your parser in CI to ensure no regression.

Actionable takeaways

Design for heterogeneous sources — build adapters for API/HTML/PDF and a canonical schema.
Keep provenance — raw saves and parsed payloads are non-negotiable for audits.
Normalize units and document conversion factors in metadata.
Align carefully — define your week anchor and roll rules for futures before analysis.
Test against historical files and implement a backfill path for reprocessing.

Call to action

Ready to automate USDA export sales in your pipeline? Clone the companion repo with working parsers, example datasets, and Airflow DAGs — or drop your dataset sample and I’ll walk through a tailored ingestion plan. For hands-on help, search the project's GitHub or reach out to build a reproducible ETL that connects USDA export sales to your commodity signals.

How to Ingest USDA Export Sales into a Commodity Analytics Pipeline

Hook: Stop hunting for clean USDA export numbers — automate the ingest

Executive summary — what you’ll build

Why this matters in 2026

Sources and formats you’ll encounter

Step 1 — Define the canonical schema

Step 2 — Fetching data: adaptable patterns

Step 3 — Normalizing units: mt ↔ bushels

Step 4 — Aligning USDA weeks with price time series

Handling futures with roll adjustments

Step 5 — Useful analyses once merged

Step 6 — Persisting and operationalizing the ETL

Airflow DAG skeleton (concept)

Best practices, edge cases and pitfalls

Example: from ingest to insight — a small workflow

Sample alert logic (pseudo)

2026 trends to incorporate

Methodology notes & limitations

Where to get price time series

Quick reference checklist before production

Appendix — Useful code snippets

Upsert to Postgres (psycopg)

Simple testing approach for PDFs

Actionable takeaways

Call to action

Related Topics

statistics

Up Next

Education Statistics by Country: Literacy, School Enrollment, and Completion Rates

Maternal Mortality by Country: Latest Ratios, Global Gaps, and Progress Over Time

Obesity Rates by Country: Adult Prevalence, Regional Patterns, and Health Trends

Hook: Stop hunting for clean USDA export numbers — automate the ingest

Executive summary — what you’ll build

Why this matters in 2026

Sources and formats you’ll encounter

Step 1 — Define the canonical schema

Step 2 — Fetching data: adaptable patterns

Step 3 — Normalizing units: mt ↔ bushels

Step 4 — Aligning USDA weeks with price time series

Handling futures with roll adjustments

Step 5 — Useful analyses once merged

Step 6 — Persisting and operationalizing the ETL

Airflow DAG skeleton (concept)

Best practices, edge cases and pitfalls

Example: from ingest to insight — a small workflow

Sample alert logic (pseudo)

2026 trends to incorporate

Methodology notes & limitations

Where to get price time series

Quick reference checklist before production

Appendix — Useful code snippets

Upsert to Postgres (psycopg)

Simple testing approach for PDFs

Actionable takeaways

Call to action

Related Reading

Related Topics

statistics

Up Next

Education Statistics by Country: Literacy, School Enrollment, and Completion Rates

Maternal Mortality by Country: Latest Ratios, Global Gaps, and Progress Over Time

Obesity Rates by Country: Adult Prevalence, Regional Patterns, and Health Trends