How We Index SEC Filings in Real Time

This piece describes the system behind Valyu's SEC filings index: the pipeline that watches EDGAR, parses every new filing into its constituent sections, embeds them, and makes them queryable through the Search and DeepResearch APIs, less than five minutes after a filing lands on EDGAR. The index covers millions of filings: 10-Ks, 10-Qs, 8-Ks, 13F institutional holdings, 13D/13G ownership. Together they break down into hundreds of millions of addressable records: document sections, holdings, ownership positions.

Most products that sell "SEC data" solve a problem that stopped being hard a long time ago. EDGAR is free, it is public, and it has been machine-accessible for three decades. Pulling a 10-K down is a curl command. Acquisition was never the difficulty. The difficulty is that a filing arrives as a single document with no machine-readable structure, that every form type is shaped differently, that filers amend, and that the corpus is large and constantly moving, with hundreds of new filings on the wire on a typical day and thousands around the quarterly deadlines. A useful SEC index has to absorb all of that without pausing and still answer a question in well under a second.

The index, not the model, decides the answer

If the index treats a 10-K as one undifferentiated blob, the best a retriever can do is return a slab of text and hope the model finds the part that matters. If the index knows that this passage is Item 1A of Apple's FY2024 10-K, filed on this date, the retriever can be precise, and the answer inherits that precision. The work is mostly in the indexing.

Why raw EDGAR is the wrong starting point

Four properties of EDGAR make it a bad place to point a retriever directly.

A filing is one document. There is no machine-readable marker for where the risk factors end and management's discussion begins. A single 10-K runs to several megabytes of HTML; the financial statements alone are hundreds of nested table cells with every number wrapped in inline-XBRL and a dozen attributes; and the headings that imply structure are styled spans whose exact wording, formatting and numbering drift across two decades of filers and hundreds of filing agents.
Every form type is shaped differently. A 10-K, a 10-Q, an 8-K and a 13F share a header and not much else. A 10-K is more than a dozen distinct items of prose across four parts. An 8-K is a handful of event items. A 13F is a structured table of holdings. One parser does not fit them.
Filers amend. When a company corrects a filing it already submitted, it doesn't edit the original; it files a fresh amendment (a 10-K/A, an 8-K/A) that restates most of it. A naive ingest stores both, and a query then returns the original and the amendment as two competing answers.
The corpus is large and moving. Across all form types this is millions of documents, and new ones land throughout every trading day, hundreds in a single afternoon during earnings season. An index built from a nightly snapshot is wrong by Monday morning.

Any one of these is a scripting problem. All four at once, continuously, with sub-second query latency, is not. The messy core, turning that markup into clean sections that line up across every filer and every year, is most of the work, and the part most "SEC APIs" don't do. Below: a few rows of a 10-K income statement, as it arrives from EDGAR and as Valyu indexes it.

Above: Apple Inc.'s FY2024 Form 10-K, the Consolidated Statements of Operations (Item 8). On the left, the document as filed on EDGAR, where every figure is wrapped in inline-XBRL <ix:nonFraction> tags and a wall of inline CSS, with no machine-readable boundary for "this is the income statement". On the right, the same ten line items as Valyu indexes them: a clean table inside an addressable section, alongside the metadata you'd filter on (document type, part/item, company, ticker, filing date and accession number).

Inside the live indexing engine

Detection: catching filings on the wire

We poll EDGAR's current-filings feed every few seconds and pull each new submission's accession number, CIK, form type, company name and timestamp as it appears. The feed is quick but it drops the occasional filing, so we reconcile it against EDGAR's full-text search index every thirty seconds and backfill anything missing. New accessions are batched and handed to processing within seconds of hitting the wire.

Parsing: a filing is not a document

Each form type has its own processor that takes the filing apart into the components people actually query. A 10-K becomes its real items: Business, Risk Factors, Properties, Legal Proceedings, MD&A, the financial statements, and the rest, across all four parts. A 10-Q becomes its quarterly items. An 8-K becomes the event items that triggered it. The 13F, 13D and 13G filings are structured XML and become normalized holding and ownership rows rather than prose. "ITEM 1A." might be a bold span padded with non-breaking spaces, or carry no number at all, or sit inside a table cell, and which of those it is changes from filer to filer and year to year. We maintain a processor for every form type, hardened against two decades of inconsistent filings, that gets it right anyway.

Every section comes out as its own record carrying the metadata a retriever needs to be precise: company, ticker, CIK, accession number, filing date, period of report, the item it belongs to, its title. By the end of this step a filing is no longer a document. It is a set of addressable sections.

Embedding: context-aware

Sections still run long, and bloat downstream model's context windows, so we chunk them. The chunking is header-aware and size-bounded, so a chunk lines up with the document's own structure instead of cutting mid-sentence. Each chunk is embedded with an embedding model we trained for this: tuned on filings and the way financial disclosure actually reads, not a general-purpose model pulled off a shelf. We also embed each chunk with the rest of the filing in view, not in isolation: using context-aware embedding methods, a chunk's vector carries where it sits in the document and what surrounds it, not just its own words. In a filing where "risk factors", "liquidity" and "related party" recur across sections and mean different things each time, that context is what keeps the right chunk on top.

Indexing: upserts, not appends

The embedded chunks go into a vector store, one logical table per form type, keyed by a deterministic ID derived from the filing and the section. That key is what makes amendments safe: writes are upserts, so reprocessing a filing, or pulling in its /A, updates the existing rows rather than duplicating them.

Each table carries three kinds of index: a full-text index over the text, structured filters over the metadata (company, date, form, the section a chunk belongs to), and a vector index over the embeddings. Constantly appending small batches fragments any index, so a background job keeps the tables compacted. Query latency stays flat as the corpus grows.

13F is a different problem

Form 13F does not fit the vector model, and forcing it to would be a mistake. Institutional managers report their holdings quarterly, and across all of them that is well over a hundred million individual line items. Nobody asks 13F questions by semantic similarity. They ask "what did Berkshire add last quarter" or "who are NVIDIA's biggest institutional holders, and how has that moved year over year." Those are aggregations, not nearest-neighbour lookups.

So 13F gets a second representation. The parsed holdings are rolled into a columnar analytical store with pre-aggregated portfolio-level and stock-level summaries, refreshed incrementally every few minutes by diffing against the latest data and applying only the new rows, so it stays current without a full rebuild. The 13D and 13G beneficial-ownership filings, the activist stakes and the passive positions over five percent, get a similar path sized to their much smaller volume. How a question turns into SQL against that store is part of serving, below.

Serving it: from a question to the right sections

The index is half the product. The other half is turning a question like "what did Tesla say about production capacity in its latest 10-Q?" - and the many variations on it that hedge funds, banks and AI builders actually run - into the right handful of records out of tens of millions. A query routed to the SEC filings source goes through four steps.

First, the query is read. A small classification model we built for exactly this, which knows the SEC form-and-item taxonomy and is fast and cheap to run at the front of every query, pulls the structured fields hiding in the prompt: which form type is in scope (10-K, 10-Q, 8-K, or the 13F / 13D / 13G branch, which is handled separately), which company, a date or date range if one is implied, such as "latest", "FY2024" or "since 2022", and which section of the filing the question is really about. "Risk factors" resolves to Item 1A. "MD&A" or "management's discussion" resolves to Item 7. Liquidity, results of operations and legal proceedings each map to their item. If nothing pins down a section, that field stays open.
Second, the company is resolved. "Tesla", "TSLA", "Tesla Inc" and "Tesla Motors" all have to land on the same filer. We keep a continuously updated index of every public company's name, ticker and CIK, and the company mention in the query is fuzzy-matched against it, returning the CIK and ticker the filings are actually stored under.
Third, the search runs, scoped. The structured fields become hard filters: form type, CIK, date range and item are applied as scalar filters on the relevant per-form table, so the search only ever touches, say, Tesla's 10-Q filings from the past year, Item 7. Within that scope the rest of the query, the actual semantic intent, runs as a hybrid search. The free-text part is embedded the same way the chunks were. It is also matched as keywords against the chunk text. The two scores are combined. Vector search catches paraphrase and concept. Keyword search catches the exact terms, tickers and proper nouns an embedding tends to blur together.
Fourth, the result is assembled. The top chunks come back ranked and deduplicated, each carrying its section metadata (company, ticker, CIK, accession, filing date, item, title) and a link to the filing on EDGAR. By default the caller gets the most relevant sections. Ask for the full filing and the chunks are stitched back together.

13F questions take the other route. Because the work is aggregation, the 13F path runs a short tool-use loop instead of a vector search: the model gets two lookup tools, one to fuzzy-resolve a fund manager's name and one to fuzzy-resolve a stock issuer or CUSIP, uses them to nail down the entities, then writes a read-only SQL query against the columnar holdings store. The query is validated before it runs, SELECT and WITH only, no schema changes, no full-table scans on the hundred-million-row holdings table, then executed with a timeout and returned as a table with manager names, reporting periods, values, and links back to the source 13F. The 13D and 13G questions get a similar treatment over their much smaller data.

The practical effect: an 8-K filed at 4:05pm is parsed, embedded, deduplicated against any amendment, and answerable by item, by company, by meaning within minutes, and for most filings inside five. No accession numbers required.

Where this shows up: the Search and DeepResearch APIs

The Valyu Search API exposes this as the valyu/valyu-sec-filings source: filings parsed into sections, embedded, deduplicated, fresh, and queryable in plain language. Ask for a section, a full filing, or a theme across many filings, and you get back ranked, cited, structured results, not a slab of HTML you have to re-parse.

Python

from valyu import Valyu

valyu = Valyu()

risk_factors = valyu.search("Apple 10-K risk factors section")

TypeScript

import { Valyu } from "valyu";

const valyu = new Valyu();
const riskFactors = await valyu.search("Risk factors from Pfizer 10-K FY2021");

Shell

curl -X POST https://api.valyu.ai/v1/search \
  -H "x-api-key: $VALYU_API_KEY" \
  -d '{"query": "Apple 10-K risk factors section"}'

The same source is available to the DeepResearch API. A research task pointed at "compare the risk factors in NVIDIA's and AMD's last three 10-Ks" or "track how Tesla's MD&A language on production capacity changed from 2022 to 2024" decomposes the prompt into sub-queries, pulls the relevant filings, cross-references them, and writes the result up with citations.

What we achieved

Filings are searchable within minutes of hitting EDGAR, which is the part that matters on filing day. Per-form parsing means a result is a section, not a document: Item 1A of a named filing, metadata attached, so the retriever and the model can both be precise. Deterministic keys mean amendments correct the record instead of cluttering it. Coverage spans 10-K, 10-Q and 8-K plus 13F holdings and 13D/13G ownership: millions of documents, over a hundred million holding line items.

If you want to bring this to production, it is live at platform.valyu.ai.

Happy building!

The index, not the model, decides the answer

Why raw EDGAR is the wrong starting point

Inside the live indexing engine

Detection: catching filings on the wire

Parsing: a filing is not a document

Embedding: context-aware

Indexing: upserts, not appends

13F is a different problem

Serving it: from a question to the right sections

Where this shows up: the Search and DeepResearch APIs

What we achieved

Related Blogs

How AI Is Reshaping Investment Research: Evidence from 4.2 Million API Queries

Introducing DeepResearch

Designing Capacity Reservation for Deep Research

How to Integrate SEC Filings into Your AI App (Complete 2025 Guide)