- Docs
- Extractors Reference
Reference
Extractors Reference
Five auto-registered extractors that bid on every NavigateResult via canExtract(), plus a sixth LLM path that the pipeline invokes directly when a caller asks for strategy: "llm" or "hybrid".
How to read this page
The extract phase takes bytes + content type from Navigate and turns them into three things downstream phases care about: clean text, optional temporal segments, and optional structured data.
- Registry order matters. The registry walks each extractor's
canExtract()probe in a deterministic specific-before-generic order. First match wins. methodon everyExtractResultisextractor:strategy(e.g.html:css,pdf:basic,llm:zod). Log it; put it in your traces.- Every extractor honours the same
ExtractOptions.signalfor caller-initiated abort.
See Concepts: Extract for the mental model and strategy selection rules.
The auto-registry (5 extractors)
These five extractors are in getDefaultExtractors() and compete for every NavigateResult via canExtract(). Registry order is specific-first – the narrower extractors fire before the HTML catch-all.
| # | Extractor | canExtract matches | Produces | Peer dep |
|---|---|---|---|---|
| 1 | transcribe | sourceType ∈ { audio, video, youtube } | text + timestamped segments | Whisper (Groq / OpenAI / Deepgram) – configured via llm.apiKey |
| 2 | deserialize | contentType contains json / yaml / xml / csv / tsv | text + parsed data – zero LLM calls | None |
| 3 | pdf | application/pdf | text via built-in PDF stream scanner | unpdf (optional – improves extraction quality) |
| 4 | ocr | image/* | text + scalar confidence | tesseract.js |
| 5 | html | text/html or plain-text content (catch-all) | text with tags stripped and whitespace collapsed | None |
The html extractor is deliberately last so any of the four narrower extractors fire first when their canExtract() matches – when you pluck a PDF, the pdf extractor wins before html ever sees the bytes.
Strategy-dispatched: the LLM extractor
The llm extractor is not in the auto-registry. The pipeline invokes createLlmExtractor() directly when the caller passes strategy: "llm" or "hybrid"; the extractor then sends the prepared content to an LLM and parses a JSON response.
| # | Extractor | How to invoke | Produces | Peer dep |
|---|---|---|---|---|
| 6 | llm | pluck(uri, { extract: { strategy: "llm", ... } }) | text + structured data from an LLM | None (direct fetch to OpenAI / Anthropic / Ollama, or MCP host sampling) |
See Native Zod passthrough below for the typed-schema flow and Concepts: Extract for the full strategy selection rules.
Extraction strategies
Three extractors – html, llm, hybrid – accept an ExtractStrategy:
type ExtractStrategy = "auto" | "css" | "regex" | "llm" | "hybrid";
| Strategy | Cost | When to use | method values emitted |
|---|---|---|---|
auto (default) | Free | Let the registry pick based on content. | html:strip-tags, html:text, deserialize:*, pdf:basic, ocr:tesseract, transcribe:whisper |
css | Free | Stable source, fields nameable by selector. | html:css |
regex | Free | Unstable markup, predictable shape. | html:regex |
llm | Model tokens | Fuzzy / drift-resistant extraction. | llm:structured, llm:zod (when schema is a Zod type) |
hybrid | CSS first, LLM fallback | Most fields are deterministic, one or two are fuzzy. | hybrid:html+llm |
Native Zod passthrough
extract.schema accepts either a JSON Schema object or a Zod schema. When a Zod schema is supplied, Pluck auto-derives fields from the top-level shape, validates the LLM response via safeParse, strips unknown keys, and surfaces method: "llm:zod". See Concepts: Extract → Native Zod schemas.
Hybrid escalation
hybrid tries html:css first. If the base extraction's confidence falls below ~0.7 or the text is shorter than 100 characters, it falls back to llm. The result carries method: "hybrid:html+llm" and merges base metadata with LLM-extracted fields.
ExtractResult shape
Every extractor emits the same shape:
interface ExtractResult {
text: string; // always present – clean text
segments?: Segment[]; // subtitles, PDF pages, regex captures
data?: Record<string, unknown>; // structured, when available
pages?: Page[]; // PDF / OCR per-page
images?: ImageResult[]; // inline images when includeImages
confidence: number; // 0..1 heuristic
method: string; // "html:css" / "transcribe:whisper" / ...
}
confidence is a heuristic – html reports 0.9 on well-structured pages, pdf:basic reports 0.7, ocr:tesseract reports Tesseract's own confidence score, transcribe:whisper reports the provider's segment-averaged confidence.
Introspection
Every live PluckInstance exposes the registry:
import { createPluck } from "@sizls/pluck";
const pluck = createPluck();
pluck.extractors.list();
// ["transcribe", "deserialize", "pdf", "ocr", "html"]
// ↑ the five auto-registered extractors. `llm` is strategy-dispatched,
// not in this list; invoke it via extract: { strategy: "llm" }.
pluck.extractors.findBest(source);
// "html" – when `source.contentType = "text/html"`
pluck.extractors.findAll(source);
// ["html", "deserialize"] – the full fallback chain, specific-first
Register your own via createPluck({ extractors: [...] }) or instance.extractors.register(customExtractor). Custom extractors are prepended, so your canExtract() wins over the built-ins.
See Concepts: Extract → Custom extractors for the defineExtractor() pattern.
What's next
- Concepts: Extract – the phase mental model, strategy selection, options.
- Concepts: Shape – the next phase; pin extracted data to a Zod contract.
- Reference: Connectors – the transports that feed bytes into extract.