Skip to content

Reference

Extractors Reference

Five auto-registered extractors that bid on every NavigateResult via canExtract(), plus a sixth LLM path that the pipeline invokes directly when a caller asks for strategy: "llm" or "hybrid".


How to read this page

The extract phase takes bytes + content type from Navigate and turns them into three things downstream phases care about: clean text, optional temporal segments, and optional structured data.

  • Registry order matters. The registry walks each extractor's canExtract() probe in a deterministic specific-before-generic order. First match wins.
  • method on every ExtractResult is extractor:strategy (e.g. html:css, pdf:basic, llm:zod). Log it; put it in your traces.
  • Every extractor honours the same ExtractOptions.signal for caller-initiated abort.

See Concepts: Extract for the mental model and strategy selection rules.


The auto-registry (5 extractors)

These five extractors are in getDefaultExtractors() and compete for every NavigateResult via canExtract(). Registry order is specific-first – the narrower extractors fire before the HTML catch-all.

#ExtractorcanExtract matchesProducesPeer dep
1transcribesourceType ∈ { audio, video, youtube }text + timestamped segmentsWhisper (Groq / OpenAI / Deepgram) – configured via llm.apiKey
2deserializecontentType contains json / yaml / xml / csv / tsvtext + parsed data – zero LLM callsNone
3pdfapplication/pdftext via built-in PDF stream scannerunpdf (optional – improves extraction quality)
4ocrimage/*text + scalar confidencetesseract.js
5htmltext/html or plain-text content (catch-all)text with tags stripped and whitespace collapsedNone

The html extractor is deliberately last so any of the four narrower extractors fire first when their canExtract() matches – when you pluck a PDF, the pdf extractor wins before html ever sees the bytes.

Strategy-dispatched: the LLM extractor

The llm extractor is not in the auto-registry. The pipeline invokes createLlmExtractor() directly when the caller passes strategy: "llm" or "hybrid"; the extractor then sends the prepared content to an LLM and parses a JSON response.

#ExtractorHow to invokeProducesPeer dep
6llmpluck(uri, { extract: { strategy: "llm", ... } })text + structured data from an LLMNone (direct fetch to OpenAI / Anthropic / Ollama, or MCP host sampling)

See Native Zod passthrough below for the typed-schema flow and Concepts: Extract for the full strategy selection rules.


Extraction strategies

Three extractors – html, llm, hybrid – accept an ExtractStrategy:

TypeScript
type ExtractStrategy = "auto" | "css" | "regex" | "llm" | "hybrid";
StrategyCostWhen to usemethod values emitted
auto (default)FreeLet the registry pick based on content.html:strip-tags, html:text, deserialize:*, pdf:basic, ocr:tesseract, transcribe:whisper
cssFreeStable source, fields nameable by selector.html:css
regexFreeUnstable markup, predictable shape.html:regex
llmModel tokensFuzzy / drift-resistant extraction.llm:structured, llm:zod (when schema is a Zod type)
hybridCSS first, LLM fallbackMost fields are deterministic, one or two are fuzzy.hybrid:html+llm

Native Zod passthrough

extract.schema accepts either a JSON Schema object or a Zod schema. When a Zod schema is supplied, Pluck auto-derives fields from the top-level shape, validates the LLM response via safeParse, strips unknown keys, and surfaces method: "llm:zod". See Concepts: Extract → Native Zod schemas.

Hybrid escalation

hybrid tries html:css first. If the base extraction's confidence falls below ~0.7 or the text is shorter than 100 characters, it falls back to llm. The result carries method: "hybrid:html+llm" and merges base metadata with LLM-extracted fields.


ExtractResult shape

Every extractor emits the same shape:

TypeScript
interface ExtractResult {
  text: string;                         // always present – clean text
  segments?: Segment[];                 // subtitles, PDF pages, regex captures
  data?: Record<string, unknown>;       // structured, when available
  pages?: Page[];                       // PDF / OCR per-page
  images?: ImageResult[];               // inline images when includeImages
  confidence: number;                   // 0..1 heuristic
  method: string;                       // "html:css" / "transcribe:whisper" / ...
}

confidence is a heuristic – html reports 0.9 on well-structured pages, pdf:basic reports 0.7, ocr:tesseract reports Tesseract's own confidence score, transcribe:whisper reports the provider's segment-averaged confidence.


Introspection

Every live PluckInstance exposes the registry:

TypeScript
import { createPluck } from "@sizls/pluck";

const pluck = createPluck();

pluck.extractors.list();
// ["transcribe", "deserialize", "pdf", "ocr", "html"]
// ↑ the five auto-registered extractors. `llm` is strategy-dispatched,
//   not in this list; invoke it via extract: { strategy: "llm" }.

pluck.extractors.findBest(source);
// "html" – when `source.contentType = "text/html"`

pluck.extractors.findAll(source);
// ["html", "deserialize"] – the full fallback chain, specific-first

Register your own via createPluck({ extractors: [...] }) or instance.extractors.register(customExtractor). Custom extractors are prepended, so your canExtract() wins over the built-ins.

See Concepts: Extract → Custom extractors for the defineExtractor() pattern.


What's next

Edit this page on GitHub
Previous
Connectors
Next
Actors

Ready to build?

Install Pluck and follow the Quick Start guide to wire MCP-first data pipelines into your agents and fleets in minutes.

Get started →