Five auto-registered extractors that bid on every NavigateResult via canExtract(), plus a sixth LLM path that the pipeline invokes directly when a caller asks for strategy: "llm" or "hybrid".

How to read this page

The extract phase takes bytes + content type from Navigate and turns them into three things downstream phases care about: clean text, optional temporal segments, and optional structured data.

Registry order matters. The registry walks each extractor's canExtract() probe in a deterministic specific-before-generic order. First match wins.
method on every ExtractResult is extractor:strategy (e.g. html:css, pdf:basic, llm:zod). Log it; put it in your traces.
Every extractor honours the same ExtractOptions.signal for caller-initiated abort.

See Concepts: Extract for the mental model and strategy selection rules.

The auto-registry (5 extractors)

These five extractors are in getDefaultExtractors() and compete for every NavigateResult via canExtract(). Registry order is specific-first – the narrower extractors fire before the HTML catch-all.

#	Extractor	`canExtract` matches	Produces	Peer dep
1	`transcribe`	`sourceType ∈ { audio, video, youtube }`	text + timestamped `segments`	Whisper (Groq / OpenAI / Deepgram) – configured via `llm.apiKey`
2	`deserialize`	`contentType` contains `json` / `yaml` / `xml` / `csv` / `tsv`	text + parsed `data` – zero LLM calls	None
3	`pdf`	`application/pdf`	text via built-in PDF stream scanner	`unpdf` (optional – improves extraction quality)
4	`ocr`	`image/*`	text + scalar confidence	`tesseract.js`
5	`html`	`text/html` or plain-text content (catch-all)	text with tags stripped and whitespace collapsed	None

The html extractor is deliberately last so any of the four narrower extractors fire first when their canExtract() matches – when you pluck a PDF, the pdf extractor wins before html ever sees the bytes.

Strategy-dispatched: the LLM extractor

The llm extractor is not in the auto-registry. The pipeline invokes createLlmExtractor() directly when the caller passes strategy: "llm" or "hybrid"; the extractor then sends the prepared content to an LLM and parses a JSON response.

#	Extractor	How to invoke	Produces	Peer dep
6	`llm`	`pluck(uri, { extract: { strategy: "llm", ... } })`	text + structured `data` from an LLM	None (direct `fetch` to OpenAI / Anthropic / Ollama, or MCP host sampling)

See Native Zod passthrough below for the typed-schema flow and Concepts: Extract for the full strategy selection rules.

Extraction strategies

Three extractors – html, llm, hybrid – accept an ExtractStrategy:

TypeScript

type ExtractStrategy = "auto" | "css" | "regex" | "llm" | "hybrid";

Strategy	Cost	When to use	`method` values emitted
`auto` (default)	Free	Let the registry pick based on content.	`html:strip-tags`, `html:text`, `deserialize:*`, `pdf:basic`, `ocr:tesseract`, `transcribe:whisper`
`css`	Free	Stable source, fields nameable by selector.	`html:css`
`regex`	Free	Unstable markup, predictable shape.	`html:regex`
`llm`	Model tokens	Fuzzy / drift-resistant extraction.	`llm:structured`, `llm:zod` (when `schema` is a Zod type)
`hybrid`	CSS first, LLM fallback	Most fields are deterministic, one or two are fuzzy.	`hybrid:html+llm`

Native Zod passthrough

extract.schema accepts either a JSON Schema object or a Zod schema. When a Zod schema is supplied, Pluck auto-derives fields from the top-level shape, validates the LLM response via safeParse, strips unknown keys, and surfaces method: "llm:zod". See Concepts: Extract → Native Zod schemas.

Hybrid escalation

hybrid tries html:css first. If the base extraction's confidence falls below ~0.7 or the text is shorter than 100 characters, it falls back to llm. The result carries method: "hybrid:html+llm" and merges base metadata with LLM-extracted fields.

`ExtractResult` shape

Every extractor emits the same shape:

TypeScript

interface ExtractResult {
  text: string;                         // always present – clean text
  segments?: Segment[];                 // subtitles, PDF pages, regex captures
  data?: Record<string, unknown>;       // structured, when available
  pages?: Page[];                       // PDF / OCR per-page
  images?: ImageResult[];               // inline images when includeImages
  confidence: number;                   // 0..1 heuristic
  method: string;                       // "html:css" / "transcribe:whisper" / ...
}

confidence is a heuristic – html reports 0.9 on well-structured pages, pdf:basic reports 0.7, ocr:tesseract reports Tesseract's own confidence score, transcribe:whisper reports the provider's segment-averaged confidence.

Introspection

Every live PluckInstance exposes the registry:

TypeScript

import { createPluck } from "@sizls/pluck";

const pluck = createPluck();

pluck.extractors.list();
// ["transcribe", "deserialize", "pdf", "ocr", "html"]
// ↑ the five auto-registered extractors. `llm` is strategy-dispatched,
//   not in this list; invoke it via extract: { strategy: "llm" }.

pluck.extractors.findBest(source);
// "html" – when `source.contentType = "text/html"`

pluck.extractors.findAll(source);
// ["html", "deserialize"] – the full fallback chain, specific-first

Register your own via createPluck({ extractors: [...] }) or instance.extractors.register(customExtractor). Custom extractors are prepended, so your canExtract() wins over the built-ins.

See Concepts: Extract → Custom extractors for the defineExtractor() pattern.

What's next

Concepts: Extract – the phase mental model, strategy selection, options.
Concepts: Shape – the next phase; pin extracted data to a Zod contract.
Reference: Connectors – the transports that feed bytes into extract.

How to read this page

The auto-registry (5 extractors)

Strategy-dispatched: the LLM extractor

Extraction strategies

Native Zod passthrough

Hybrid escalation

ExtractResult shape

Introspection

What's next

`ExtractResult` shape