datasetspolicytransparency

Creating an Open Dataset of ICE Custody Deaths and Media Coverage

UUnknown

2026-02-28

11 min read

A practical guide to assembling a reproducible, open CSV of ICE custody deaths linked to media mentions for transparent analysis and reporting.

Hook: Why you need a reproducible dataset of ICE custody deaths and media mentions — fast

Data teams, reporters, and researchers tell us the same thing: finding trustworthy, citable records of deaths in ICE custody and matching those records to media coverage is time-consuming, error-prone, and often unreproducible. With renewed scrutiny in late 2025 and early 2026 around high-profile cases — and rising demands for transparency — a well-documented, reproducible dataset that pairs ICE custody deaths with media mentions fills a practical gap for analysis, reporting, and policy research.

The project objective and the 2026 context

Objective: Assemble an open, versioned CSV dataset of recorded ICE custody deaths with standardized fields (dates, demographics, facility, cause, official sources) and a linked collection of media mentions (count, first mention date, URLs, source metadata). The dataset must be reproducible: full data pipeline, code, and documentation available for inspection and reuse.

Why now? In late 2025 and into 2026, news coverage and congressional attention about immigration enforcement practices rose after several widely reported incidents. Researchers are demanding more rigorous links between administrative records and public coverage to study media bias, response times, and correlation with policy changes. This guide translates those demands into an operational, repeatable workflow.

High-level approach — inverted pyramid for reproducibility

Collect authoritative death records. Start from official sources and validated third-party trackers.
Normalize and document the data model. Produce a clear CSV schema and a data dictionary.
Gather media mentions. Use media APIs and archives (Media Cloud, GDELT, News API) to identify mentions and timestamps.
Link records to mentions reproducibly. Use deterministic matching rules plus manual audit logs for ambiguous cases.
Publish with provenance and versioning. Host CSVs and code on GitHub, register DOIs via Zenodo/Dataverse, and include CITATION and LICENSE files.

Step 1 — Sources: where to gather death records

Begin with official sources and cross-validate with investigative databases. Recommended starting points:

DHS/ICE reports and press releases. ICE publishes public statements and case-specific material; collect URLs and PDF copies.
DHS Office of Inspector General (OIG). OIG investigations and summaries often include findings about deaths under custody.
State and local coroner records. Coroner/medical examiner reports can confirm cause and date of death.
Freedom of Information Act (FOIA) disclosures. FOIA productions often contain the administrative records that underpin official counts.
Independent trackers and investigations. Reputable third-party lists (for example, investigative projects published by ProPublica, The Guardian, and local newsrooms) are useful for cross-checks and to recover early reporting links.

Best practice: Save the original source (PDF, HTML snapshot) and capture metadata: retrieval date, publisher, and a persistent link (archived via WebRecorder or the Internet Archive).

Step 2 — Define a reproducible data model and CSV schema

Design a flat CSV schema that supports record linkage and analysis while minimizing personally identifiable information where not required. Below is a recommended core schema (all fields are comma-delimited in the final CSV):

death_id — unique stable identifier (GUID or prefixed integer)
name — as reported (NULL if withheld)
date_of_death — ISO 8601 (YYYY-MM-DD)
facility_name
custody_type — e.g., ICE detention center, CBP custody, contract facility
cause_of_death — short controlled vocabulary (e.g., homicide, suicide, natural, accidental, pending)
age — integer or NULL
sex — M/F/Other/Unknown
nationality
race_ethnicity
arrest_date — ISO 8601 or NULL
detention_duration_days — integer or NULL
official_source_url — URL to ICE/DHS/coroner/FOIA document
media_mentions_count — integer (populated after media pass)
first_media_date — ISO 8601 or NULL
media_sources — semi-colon separated list of source keys (for machine and human readability)
media_urls — semi-colon separated URLs (archive versions preferred)
coding_confidence — numeric 0-1 or categorical (high/medium/low)
notes — free text for ambiguous or sensitive issues

Data dictionary: For each field, produce a one-line definition and allowed values. Include an explicit privacy policy explaining how PII is handled and what is omitted for legal/ethical reasons.

Step 3 — Collecting media mentions: APIs and archives (2026 options)

By 2026, several robust tools exist for large-scale media extraction. Choose a combination of APIs and datasets to maximize coverage and reproducibility:

Media Cloud — excellent for historical mainstream and local coverage, topic-level queries, and tracking coverage over time.
GDELT (Global Knowledge Graph / GKG) — broad global coverage with mention timestamps and entity co-occurrences; useful for volume metrics.
Common Crawl — when you need raw HTML snapshots; pair with the Index and the CC-Why tools.
Commercial news APIs (News API, GNews, etc.) — good for recent coverage but watch rate limits and pay models.
Subscription databases and local archives — LexisNexis, Factiva, and local newspaper archives fill gaps for paywalled content; capture metadata even when paywalls block full text.

Practical tip: Always archive retrieved article URLs via the Internet Archive or Perma.cc and store the archive link in media_urls. That ensures long-term reproducibility even if outlets change paywall status or remove pieces.

Step 4 — Match deaths to media mentions reproducibly

Linking administrative records to media mentions is the trickiest, and the step where reproducibility and documentation matter most. Use a two-stage approach: deterministic matching followed by probabilistic/fuzzy matching with an auditable manual review.

Deterministic rules (first pass)

Exact match on name + date_of_death within a 0–2 day window
Exact match on facility_name + date when names are absent
Use official_source_url mentions to derive canonical identifiers

Fuzzy / probabilistic rules (second pass)

Entity-resolution with fuzzy string matching (rapidfuzz/fuzzywuzzy) on name fields
NER (spaCy) to extract person and facility names from articles and compare with records
Date-window matching: accept mentions within ±14 days of recorded death, but flag for manual review
Scoring function that combines name similarity, date proximity, and facility co-occurrence; set a threshold for automatic linking and a lower band for human review

Audit trail: Record the match algorithm version, matching score, and reviewer notes in the dataset (e.g., coding_confidence and notes fields). This is essential so third parties can reproduce or challenge linkages.

Step 5 — Data cleaning, normalization, and quality control

Cleaning steps that materially improve reuse:

Normalize dates to ISO 8601 and time zones to UTC.
Standardize facility names via an authority table (facility_id, normalized_name, aliases, lat/lon).
Enforce controlled vocabularies for cause_of_death and custody_type.
Use automated checks: duplicate detection, out-of-range ages, missing critical fields.
Run unit tests on transformation code and include them in CI (GitHub Actions/GitLab CI).

Step 6 — Packaging, licensing, and publishing

For reproducibility and trust:

Code + data in one repo. Place raw sources, transformation scripts, and final CSV in a Git repo with clear README.
Use semantic versioning. Tag releases (v1.0.0) and describe changes in a changelog.
Deposit a release snapshot to a long-term archive. Zenodo and Dataverse mint DOIs for GitHub releases; include the DOI in the citation.
Choose a permissive license. CC BY or CC0 are typical for data; add a LICENSE and a CITATION file with an example citation string.
Provide machine-readable metadata. Include dataset_description.json and README with a data dictionary and methodology.

Step 7 — Visualize: interactive charts and dashboards

Interactive visualizations increase discoverability and value for your audience of developers and data journalists. Recommended visualizations and tools:

Time series of deaths per month, with overlays for major policy events (use Vega-Lite or Observable for easy sharing).
Coverage timeline showing first mention lag vs. media mentions count — helps study responsiveness.
Geospatial map of facility locations colored by count or rate (Datawrapper, Leaflet, or Mapbox).
Demographic breakdown (age, sex, nationality) — stacked bar charts and small multiples.
Network view connecting deaths, facility IDs, and media sources to show concentration of reporting.

Toolchain tip: Host the CSV on GitHub and use raw.githubusercontent links as the data source for Observable notebooks, Datawrapper, or Flourish so charts update when the CSV is updated. For reproducible deployments, use GitHub Actions to regenerate derived datasets and publish to a data portal or S3 bucket.

Ethics, privacy, and legal considerations

Handling death records requires care. Key considerations:

PII minimization. Only include names when they are present in public official records; consider redaction policies for family privacy.
Medical privacy. Cause of death should reflect official determinations; avoid speculative coding.
Copyright and paywalls. Archive and cite metadata for paywalled content but do not republish full paywalled text without permission.
Attribution. Attribute original reporting and official sources in the dataset README and in any derived visualizations.

Validation and community feedback

Invite public review and correction. Practical mechanisms that work in 2026:

Issue tracker for data corrections (GitHub Issues) with labels for verification needed, accepted, and rejected.
Provide a simple web form for submission of missing records or corrections, logged into the repo as issues.
Publish periodic data audits summarizing manual changes and shifts in coding rules.

Example reproducible pipeline (concise)

The following is an operational pipeline outline you can implement quickly using Python and common tools:

Raw collection: scripts that pull ICE PDFs, OIG reports, and known trackers; store originals in /raw/ with manifest.csv.
Parsing: extract structured fields with heuristics (pdfplumber for PDFs, newspaper3k for HTML).
Normalization: convert to canonical schema, run validation tests (pytest), and write /data/deaths-v{version}.csv.
Media pass: query Media Cloud and GDELT for each death_id; store results in /data/media-mentions-v{version}.csv.
Matching: run deterministic and fuzzy match scripts that annotate media matches and produce a matched CSV with audit columns.
Publish: tag release, push to GitHub, trigger Zenodo DOI creation, and deploy dashboard (Observable or Datawrapper embeds).

Limitations and common pitfalls

No dataset is perfect. Be transparent about these limits in your README:

Official counts may lag or omit cases; FOIA gaps are frequent.
Media coverage bias: local language and outlet reach affect detectability.
Matching errors: similar names, missing names, and inconsistent date reporting cause false positives/negatives.
Paywalled content and deletions lead to unstable media URLs; archiving is necessary but not foolproof.

Case study: how a newsroom used this pipeline (experience)

In late 2025, a mid-size newsroom used a pipeline like this to rapidly assemble a dataset for a national investigation. They combined ICE press releases, FOIA material, and local coroner records to create a canonical list of 120 deaths over five years. They then used Media Cloud to track coverage spikes and discovered a systematic delay: national outlets were 7–10 days slower than local outlets in reporting custody deaths. By publishing an open dataset and interactive timeline, the newsroom supported follow-up FOIAs and prompted an oversight hearing query — demonstrating direct policy impact. Keep a public audit log of decisions so others can replicate or contest your findings.

Actionable checklist to get started this week

Create a GitHub repo and add a README with project scope and ethical rules.
Collect 5 canonical official records and archive them (Internet Archive).
Draft the CSV schema and data dictionary as described above.
Run a media search (Media Cloud/GDELT) for those 5 events and produce a small matched CSV.
Publish a v0.1 release, add a CITATION file, and request feedback via Issues.

"Transparency requires not just open data but open methods." — project principle

Where to host and how to make downloads frictionless

For broad discoverability and ease of use:

GitHub for code and CSV hosting (raw files accessible via raw.githubusercontent).
Zenodo/Dataverse to mint DOIs for releases and provide long-term preservation.
Figshare or S3 + CloudFront for large archives or heavy downloads.
Embed interactive charts on a project site (Netlify/GitHub Pages) and provide a clear download link to the CSV and citation text.

Final notes: credibility, reproducibility, and your audience

Developers, technologists, and data journalists expect two things in 2026: reproducible pipelines and clear provenance. Building an open dataset of ICE custody deaths and linking it to media mentions is valuable only if you make your decisions transparent, your code auditable, and your sources accessible. Use automated tests, versioning, and archival strategies so your dataset can be cited confidently by academics, reporters, and policymakers.

Call to action

If you want a jumpstart, fork our reference repository (link in the project README), run the one-click pipeline to generate v0.1 CSV, and open an Issue for any missing records or corrections. Contribute parsed official documents, suggest controlled vocabulary changes, or help build the Observable dashboard. Together we can make coverage and custody data transparent, auditable, and useful for policy and public interest research.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Up Next

Press Briefings NLP: Sentiment, Aggression, and Fact-Checking Karoline Leavitt’s Tirade

templates•10 min read

Quarterback Return Impact: Spreadsheet Template and Dataset for Coaches & Analysts

college football•9 min read

Modeling a QB Comeback: Predicting John Mateer’s 2026 Performance After Hand Injury

market analysis•11 min read

Edge Estimation: Quantify How Much Predictive Models Beat Public Betting Lines

explainability•10 min read

Explainable Probabilistic Models: Interpreting Monte Carlo Outputs for Bettors and Devs

From Our Network

Trending stories across our publication group

Creative Angles for Coverage of a Club Captain’s Big-Money Exit

worldsnews.xyz

content ideas•12 min read

Creative Angles for Coverage of a Club Captain’s Big-Money Exit

Monetize Predictive Content: Building Affiliate & Subscription Products Around Sports Models

globalnews.cloud

monetization•11 min read

Monetize Predictive Content: Building Affiliate & Subscription Products Around Sports Models

‘I’d Love to Be a WWE Wrestler’: When Football Stars Fantasize About Other Fame Worlds

newsworld.live

Athlete Culture•9 min read

‘I’d Love to Be a WWE Wrestler’: When Football Stars Fantasize About Other Fame Worlds

AM Best Rating Upgrades: Building an Insurer Financials Dataset for Risk Teams

worlddata.cloud

insurance•10 min read

AM Best Rating Upgrades: Building an Insurer Financials Dataset for Risk Teams

Earnings Misses at Big Banks: Tactical Plays for Fixed‑Income and Equity Investors

worldeconomy.live

trading•10 min read

Earnings Misses at Big Banks: Tactical Plays for Fixed‑Income and Equity Investors

Transfer Window Playbook: What Marc Guehi’s Move to Man City Means for Palace and England