- Docs
- Core Concepts
- Extract
Core Concepts
Extract
The third phase of the Pluck pipeline. Bytes in, structured content out.
The mental model
Connect hands the pipeline raw bytes plus a content type. Navigate prepares those bytes – pass-through by default, Readability / Playwright / agent-driven when asked. Extract's job is to take the prepared result and convert it into three things every downstream phase can work with:
text– a clean, human-readable string.segments– optional temporal or positional chunks (subtitle cues, PDF pages, regex captures).data– optional structured key/value output for APIs that return JSON or for LLM-powered extraction against a schema.
Extract is a phase, not a standalone verb. You call pluck(uri, { extract }) and the phase runs after connect. The phase is also deliberately pluggable – each extractor declares what shapes of NavigateResult it can handle via a canExtract() probe, and the registry picks the first match.
import { pluck } from "@sizls/pluck";
// Default auto strategy – registry picks the best extractor for the content
const result = await pluck("https://example.com/post");
console.log(result.text); // clean text
console.log(result.segments); // per-section chunks where available
Built-in extractors
Pluck ships with 5 extractors, registered in a specific-before-generic order. The registry picks the first extractor whose canExtract() returns true, so narrow extractors win by being earlier in the list:
| # | Extractor | Matches when | What it produces |
|---|---|---|---|
| 1 | transcribe | sourceType ∈ { audio, video, youtube } | text + timestamped segments via Whisper (Groq / OpenAI / Deepgram) |
| 2 | deserialize | contentType contains json, yaml, xml, csv, or tsv | text + parsed data – zero LLM calls |
| 3 | pdf | application/pdf | text via built-in PDF stream scanner (install unpdf for better extraction) |
| 4 | ocr | image/* | text + scalar confidence via Tesseract WASM |
| 5 | html | string content with html or text contentType | text with HTML tags stripped and whitespace collapsed |
html is the catch-all – it's last in the list by design, so any of the four narrower extractors fire first when their canExtract() matches. When you pluck a PDF, the pdf extractor wins before html ever sees the bytes.
The method field on every ExtractResult tells you which extractor ran and which sub-strategy it used – formatted as extractor:strategy. Examples: html:strip-tags, html:text, pdf:basic, ocr:tesseract, transcribe:whisper, deserialize:json, deserialize:yaml, deserialize:csv, deserialize:xml, deserialize:tsv, llm:structured, llm:zod, hybrid:html+llm.
Full extractor list lives at Reference → API.
Strategies
The HTML extractor, and any extractor that uses structured fields, accepts an ExtractStrategy:
type ExtractStrategy = "llm" | "css" | "regex" | "hybrid" | "auto";
css– selectors. Zero LLM calls. Fastest and free. Best when the source is stable and you can name the fields by selector:TypeScriptawait pluck("https://news.example.com", { extract: { strategy: "css", selectors: ["article h1", "article .byline", "time"], }, });regex– capture patterns. Also zero LLM calls. Best when the shape is predictable but the markup is not:TypeScriptawait pluck("https://example.com/status", { extract: { strategy: "regex", patterns: [/Order #(\d+)/, /Total: \$([\d.]+)/], }, });llm– pass a prompt and a schema; Claude or another model returns typed JSON:TypeScriptawait pluck("https://example.com/complex-post", { extract: { strategy: "llm", prompt: "Extract the headline, author, and publish date", fields: ["headline", "author", "date"], llm: { apiKey: process.env.ANTHROPIC_KEY, model: "claude-sonnet-4-6" }, }, });Native Zod schemas. Pass a Zod schema directly and Pluck takes it from there – field names are auto-derived from the top-level shape, the LLM response is validated with
safeParse(), andresult.datahas the inferred type. On mismatch, Pluck throwsExtractionError("LLM_SCHEMA_MISMATCH")with the failing path; on success,result.methodis"llm:zod"so traces can tell Zod-validated calls apart from free-form ones.TypeScriptimport { z } from "zod"; import { pluck } from "@sizls/pluck"; const Product = z.object({ name: z.string(), price: z.number(), inStock: z.boolean(), }); const { data } = await pluck("https://example.com/p/widget-pro", { extract: { strategy: "llm", schema: Product, llm: { apiKey: process.env.ANTHROPIC_KEY, model: "claude-sonnet-4-6" }, }, }); // data is validated, unknown fields stripped, and matches z.infer<typeof Product>hybrid– try CSS first, fall back to LLM only for fields CSS couldn't fill. Best for drift-resistant scraping where most fields are deterministic and one or two are fuzzy.auto(default) – the registry picks. Browsers of structured content (RSS, JSON, YAML) getdeserialize; media getstranscribe/ocr/pdf; HTML getshtmlwith a content-appropriate inner strategy.
Side-by-side: same URL, four strategies
Pick the strategy that maps to what you actually know about the source. Here's the same product page extracted four different ways:
import { pluck } from "@sizls/pluck";
import { z } from "zod";
const URL = "https://example.com/products/widget-pro";
// 1. CSS – you know the DOM. Cheapest. Deterministic.
const css = await pluck(URL, {
extract: {
strategy: "css",
selectors: [".product-title", ".price", ".sku"],
},
});
// → result.data: { "product-title": "...", price: "...", sku: "..." }
// method: "html:css"
// 2. Regex – markup is unstable but the shape is predictable.
const regex = await pluck(URL, {
extract: {
strategy: "regex",
patterns: [/SKU:\s*([A-Z0-9-]+)/, /\$([\d,.]+)/],
},
});
// → result.segments: [{ captures: ["WP-001"] }, { captures: ["29.99"] }]
// method: "html:regex"
// 3. LLM with Zod – you want typed output, don't care about the DOM.
const Product = z.object({
title: z.string(),
price: z.number(),
sku: z.string(),
});
const typed = await pluck(URL, {
extract: {
strategy: "llm",
schema: Product,
llm: {
apiKey: process.env.ANTHROPIC_API_KEY,
model: "claude-sonnet-4-6",
},
},
});
// → result.data: { title: "Widget Pro", price: 29.99, sku: "WP-001" }
// method: "llm:zod" (because a Zod schema was passed)
// 4. Hybrid – most of the page is stable CSS, one or two fields are fuzzy.
const hybrid = await pluck(URL, {
extract: {
strategy: "hybrid",
selectors: [".product-title", ".sku"], // deterministic
prompt: "Extract the current sale price in dollars as a number.",
llm: { apiKey: process.env.ANTHROPIC_API_KEY },
},
});
// → hybrid.data merges CSS + LLM outputs
// method: "hybrid:html+llm"
Rule of thumb for picking one:
| Signal | Pick |
|---|---|
| You can name fields by CSS selector. | css |
| Markup is noisy but the text shape is predictable. | regex |
| You know the output shape but not the HTML. | llm + Zod |
| Source is structured JSON / YAML / RSS / CSV. | auto (→ deserialize) |
| Source is audio / video. | auto (→ transcribe) |
| Mix of deterministic + fuzzy fields. | hybrid |
Extract options
Every call through pluck() accepts an extract option block:
interface ExtractOptions {
strategy?: ExtractStrategy; // default "auto"
selectors?: string[]; // css
patterns?: RegExp[]; // regex
prompt?: string; // llm
fields?: string[]; // llm – whitelist
schema?: Record<string, unknown> | z.ZodType; // llm – JSON Schema or Zod
includeImages?: boolean; // ocr result.images[]
ocr?: OcrConfig; // per-call Tesseract tuning
llm?: LlmConfig; // per-call LLM credentials + model
signal?: AbortSignal; // per-call abort
}
Pluck's options are open-ended on purpose. Every extractor reads the options it cares about and ignores the rest – the same option block flows through the pipeline unchanged.
What you get back
Every extractor produces the same ExtractResult shape:
interface ExtractResult {
text: string; // always present – clean text
segments?: Segment[]; // subtitles, PDF pages, regex captures
data?: Record<string, unknown>; // structured, when available
pages?: Page[]; // PDF / OCR per-page
images?: ImageResult[]; // inline images when includeImages
confidence: number; // 0..1 heuristic
method: string; // "html:css" / "transcribe:whisper" / etc.
}
method is the single most useful field for debugging – it tells you which extractor won and which sub-strategy it used. Log it; put it in your traces.
Custom extractors
The extractor registry is open. Same pattern as connectors – defineExtractor gives you parameter inference inside canExtract and extract, and createPluck({ extractors: [...] }) registers at construction time:
import { createPluck, defineExtractor } from "@sizls/pluck";
const pricingTable = defineExtractor({
name: "pricing-table",
canExtract: (source) =>
source.contentType.includes("text/html") &&
source.url.includes("/pricing"),
async extract(source, options) {
const html = String(source.content);
const tiers = Array.from(html.matchAll(/<tr class="tier">([\s\S]*?)<\/tr>/g));
return {
text: tiers.map((t) => t[1]).join("\n"),
data: { tiers: tiers.map((t, i) => ({ position: i, raw: t[1] })) },
confidence: 0.9,
method: "pricing-table:regex",
};
},
});
const pluck = createPluck({ extractors: [pricingTable] });
canExtract gets called first for every extraction attempt. Return true when the NavigateResult is something your extractor understands. The registry picks the first match in order – custom extractors run before built-ins, whether you pass them through createPluck({ extractors }) or register post-construction via instance.extractors.register(pricingTable).
Want to inspect the registry or answer "who would extract this source?" without running anything?
pluck.extractors.list(); // ["pricing-table", "transcribe", "deserialize", ...]
pluck.extractors.findBest(source); // "pricing-table"
pluck.extractors.findAll(source); // ["pricing-table", "html"] – fallback chain
Drift detection
Extraction is where web pages bite you – a selector breaks, a field disappears, an LLM starts returning a different shape. Pluck's extract phase itself stays silent about drift (it returns whatever it got), but the next phase – Shape – catches every drift moment with an onDrift signal against a Zod schema. The pair is the point: extract pulls, shape validates against your contract.
For contract testing today, extract + shape gives you a "selector changed in production" signal in ~15 lines. The roadmap includes pluck.verify(url, schema) as a CLI daemon – see Ideas backlog.
What's next
- Shape – once you have extracted content, pin its shape with Zod.
- Act – signed, reversible mutations on the same URIs.
- Reference: CLI – every command that composes connect → extract → shape.