Methodology

How Appropria is built.

A transparent, end-to-end account of how we turn two public datasets into a clean, attributed, point-in-time record of which public companies earn federal dollars.

Version 1.0 · sources: USASpending.gov, SEC EDGAR · public-domain inputs only
01

Ingestion

Two public-domain sources, refreshed daily.

  • USASpending.gov - prime award transactions, sub-award (FSRS) records, and federal-account financials, pulled by fiscal year and action date.
  • SEC EDGAR - the company_tickers.json issuer↔ticker reference, and the Exhibit-21 “Subsidiaries of the Registrant” exhibit from each public filer’s latest 10-K / 20-F.

EDGAR access respects the SEC fair-access policy (≤10 requests/second, declared User-Agent). No scraping of authenticated or non-public pages; no personal data is ingested.

02

Normalization & crosswalk

Names are cleaned, then mapped to a parent ticker built from filings.

Recipient names are written inconsistently, so every name is first reduced to a canonical form:

# normalize(name)
lower-case → strip punctuation & symbols → collapse whitespace
strip legal-form tokens only: inc, corp, corporation, company, co, llc, lp, ltd, the, and
keep distinctive tokens: systems, solutions, technologies, services, group…

Stripping only legal-form suffixes is deliberate - removing distinctive words like “Systems” would collapse unrelated firms onto each other (e.g. Paragon Systems vs. an unrelated “Paragon”). We learned this the hard way and corrected for it.

Then, for every public filer, we parse its EX-21 schedule and map each normalized subsidiary name to the parent’s ticker:

5,251subsidiary → ticker entries, auto-built from SEC Exhibit-21 filings

This is why ELECTRIC BOAT CORP resolves to GD and SIKORSKY AIRCRAFT to LMT - names a direct issuer-name match never catches. The crosswalk rebuilds from filings, so it updates as companies reorganize.

03

Entity resolution

Each recipient runs through a deterministic-first match cascade.

1. exact normalized match → SEC issuer name (the prime)
2. exact normalized match → EX-21 subsidiary (the parent)
3. fuzzy match → SEC issuer (token-set ratio ≥ 94)
4. fuzzy match → EX-21 subsidiary (token-set ratio ≥ 94)
else → unresolved (treated as private / non-public)

Fuzzy matching uses an order- and length-tolerant token-set ratio, so it aligns variants like “Raytheon Company” with the issuer “RTX Corp”. A high threshold keeps false positives rare.

Precision over recall. In the shipped product we publish exact-match resolutions only. A wrong ticker is far more damaging to a research user than a missing one, so fuzzy candidates are held for review rather than auto-published.

Every resolution carries a method and a confidence grade:

TierMethodHandling
Highexact issuer / exact EX-21 subsidiarypublished
Reviewfuzzy ≥ thresholdheld for human review
Excludeduniversities, JVs, non-profits, private firms, governmentflagged non-public
04

Point-in-time & lag

We snapshot the data daily, because the source overwrites itself.

The lag is large. The most recent ~3 months of obligations run on the order of 80%+ below trailing run-rate - not a real decline, just incomplete reporting. Any naive read of fresh data is wrong.

To fix this we capture a point-in-time snapshot every day: cumulative obligations and outlays per ticker, stamped with the capture date. Because the source overwrites itself, these vintages cannot be reconstructed after the fact - they exist only because we record them daily. The day-over-day delta yields a spending-flow series the source never publishes directly.

05

Data quality

Three controls keep the output clean.

  • EX-21 boilerplate removal. Filing exhibits contain headers and jurisdictions (“Delaware”, “Name of Subsidiary”); we filter these so they cannot become spurious matches.
  • Sub-award sanity bounds. FSRS amounts are self-reported and frequently mis-keyed - a meaningful share of the largest records exceed any plausible value (we have observed entries in the trillions). Implausible figures are capped or discarded and logged.
  • Idempotent capture. Daily snapshots are de-duplicated per (date, ticker), so re-runs never double-count.
06

Coverage & limits

What the dataset covers today - and where it is deliberately conservative.

36% → 63%award dollars attributed to a public ticker: naive issuer-name matching vs. with the EX-21 crosswalk

History extends to FY2016 (extendable to FY2008) across the public companies that receive federal awards - concentrated in defense, government IT/services, and healthcare.

  • Outlay coverage is sparse. Many awards report no outlay-to-date; outlay figures are partial.
  • Joint ventures are approximate. Multi-parent management JVs are attributed to a lead public parent where one exists, and flagged.
  • Figures are revised. All amounts change as agencies update filings; we preserve vintages but do not restate the source.
  • Federal share varies. Attribution is most material for government-dependent companies; for diversified issuers, federal dollars may be a small part of total revenue.
Disclaimer. Appropria is a research data product. It provides data and analytics for research and informational purposes only and is not investment advice, nor a recommendation, offer, or solicitation to buy or sell any security. Built solely on US public records (USASpending.gov and SEC EDGAR). US data only.
← Back to home