Skip to content

Core Concepts

Extract

The third phase of the Pluck pipeline. Bytes in, structured content out.


The mental model

Connect hands the pipeline raw bytes plus a content type. Navigate prepares those bytes – pass-through by default, Readability / Playwright / agent-driven when asked. Extract's job is to take the prepared result and convert it into three things every downstream phase can work with:

  1. text – a clean, human-readable string.
  2. segments – optional temporal or positional chunks (subtitle cues, PDF pages, regex captures).
  3. data – optional structured key/value output for APIs that return JSON or for LLM-powered extraction against a schema.

Extract is a phase, not a standalone verb. You call pluck(uri, { extract }) and the phase runs after connect. The phase is also deliberately pluggable – each extractor declares what shapes of NavigateResult it can handle via a canExtract() probe, and the registry picks the first match.

TypeScript
import { pluck } from "@sizls/pluck";

// Default auto strategy – registry picks the best extractor for the content
const result = await pluck("https://example.com/post");
console.log(result.text);      // clean text
console.log(result.segments);  // per-section chunks where available

Built-in extractors

Pluck ships with 5 extractors, registered in a specific-before-generic order. The registry picks the first extractor whose canExtract() returns true, so narrow extractors win by being earlier in the list:

#ExtractorMatches whenWhat it produces
1transcribesourceType ∈ { audio, video, youtube }text + timestamped segments via Whisper (Groq / OpenAI / Deepgram)
2deserializecontentType contains json, yaml, xml, csv, or tsvtext + parsed data – zero LLM calls
3pdfapplication/pdftext via built-in PDF stream scanner (install unpdf for better extraction)
4ocrimage/*text + scalar confidence via Tesseract WASM
5htmlstring content with html or text contentTypetext with HTML tags stripped and whitespace collapsed

html is the catch-all – it's last in the list by design, so any of the four narrower extractors fire first when their canExtract() matches. When you pluck a PDF, the pdf extractor wins before html ever sees the bytes.

The method field on every ExtractResult tells you which extractor ran and which sub-strategy it used – formatted as extractor:strategy. Examples: html:strip-tags, html:text, pdf:basic, ocr:tesseract, transcribe:whisper, deserialize:json, deserialize:yaml, deserialize:csv, deserialize:xml, deserialize:tsv, llm:structured, llm:zod, hybrid:html+llm.

Full extractor list lives at Reference → API.


Strategies

The HTML extractor, and any extractor that uses structured fields, accepts an ExtractStrategy:

TypeScript
type ExtractStrategy = "llm" | "css" | "regex" | "hybrid" | "auto";
  • css – selectors. Zero LLM calls. Fastest and free. Best when the source is stable and you can name the fields by selector:

    TypeScript
    await pluck("https://news.example.com", {
      extract: {
        strategy: "css",
        selectors: ["article h1", "article .byline", "time"],
      },
    });
    
  • regex – capture patterns. Also zero LLM calls. Best when the shape is predictable but the markup is not:

    TypeScript
    await pluck("https://example.com/status", {
      extract: {
        strategy: "regex",
        patterns: [/Order #(\d+)/, /Total: \$([\d.]+)/],
      },
    });
    
  • llm – pass a prompt and a schema; Claude or another model returns typed JSON:

    TypeScript
    await pluck("https://example.com/complex-post", {
      extract: {
        strategy: "llm",
        prompt: "Extract the headline, author, and publish date",
        fields: ["headline", "author", "date"],
        llm: { apiKey: process.env.ANTHROPIC_KEY, model: "claude-sonnet-4-6" },
      },
    });
    

    Native Zod schemas. Pass a Zod schema directly and Pluck takes it from there – field names are auto-derived from the top-level shape, the LLM response is validated with safeParse(), and result.data has the inferred type. On mismatch, Pluck throws ExtractionError("LLM_SCHEMA_MISMATCH") with the failing path; on success, result.method is "llm:zod" so traces can tell Zod-validated calls apart from free-form ones.

    TypeScript
    import { z } from "zod";
    import { pluck } from "@sizls/pluck";
    
    const Product = z.object({
      name: z.string(),
      price: z.number(),
      inStock: z.boolean(),
    });
    
    const { data } = await pluck("https://example.com/p/widget-pro", {
      extract: {
        strategy: "llm",
        schema: Product,
        llm: { apiKey: process.env.ANTHROPIC_KEY, model: "claude-sonnet-4-6" },
      },
    });
    
    // data is validated, unknown fields stripped, and matches z.infer<typeof Product>
    
  • hybrid – try CSS first, fall back to LLM only for fields CSS couldn't fill. Best for drift-resistant scraping where most fields are deterministic and one or two are fuzzy.

  • auto (default) – the registry picks. Browsers of structured content (RSS, JSON, YAML) get deserialize; media gets transcribe / ocr / pdf; HTML gets html with a content-appropriate inner strategy.

Side-by-side: same URL, four strategies

Pick the strategy that maps to what you actually know about the source. Here's the same product page extracted four different ways:

TypeScript
import { pluck } from "@sizls/pluck";
import { z } from "zod";

const URL = "https://example.com/products/widget-pro";

// 1. CSS – you know the DOM. Cheapest. Deterministic.
const css = await pluck(URL, {
  extract: {
    strategy: "css",
    selectors: [".product-title", ".price", ".sku"],
  },
});
// → result.data: { "product-title": "...", price: "...", sku: "..." }
//   method: "html:css"

// 2. Regex – markup is unstable but the shape is predictable.
const regex = await pluck(URL, {
  extract: {
    strategy: "regex",
    patterns: [/SKU:\s*([A-Z0-9-]+)/, /\$([\d,.]+)/],
  },
});
// → result.segments: [{ captures: ["WP-001"] }, { captures: ["29.99"] }]
//   method: "html:regex"

// 3. LLM with Zod – you want typed output, don't care about the DOM.
const Product = z.object({
  title: z.string(),
  price: z.number(),
  sku: z.string(),
});
const typed = await pluck(URL, {
  extract: {
    strategy: "llm",
    schema: Product,
    llm: {
      apiKey: process.env.ANTHROPIC_API_KEY,
      model: "claude-sonnet-4-6",
    },
  },
});
// → result.data: { title: "Widget Pro", price: 29.99, sku: "WP-001" }
//   method: "llm:zod"     (because a Zod schema was passed)

// 4. Hybrid – most of the page is stable CSS, one or two fields are fuzzy.
const hybrid = await pluck(URL, {
  extract: {
    strategy: "hybrid",
    selectors: [".product-title", ".sku"], // deterministic
    prompt: "Extract the current sale price in dollars as a number.",
    llm: { apiKey: process.env.ANTHROPIC_API_KEY },
  },
});
// → hybrid.data merges CSS + LLM outputs
//   method: "hybrid:html+llm"

Rule of thumb for picking one:

SignalPick
You can name fields by CSS selector.css
Markup is noisy but the text shape is predictable.regex
You know the output shape but not the HTML.llm + Zod
Source is structured JSON / YAML / RSS / CSV.auto (→ deserialize)
Source is audio / video.auto (→ transcribe)
Mix of deterministic + fuzzy fields.hybrid

Extract options

Every call through pluck() accepts an extract option block:

TypeScript
interface ExtractOptions {
  strategy?: ExtractStrategy;   // default "auto"
  selectors?: string[];         // css
  patterns?: RegExp[];          // regex
  prompt?: string;              // llm
  fields?: string[];            // llm – whitelist
  schema?: Record<string, unknown> | z.ZodType; // llm – JSON Schema or Zod
  includeImages?: boolean;      // ocr result.images[]
  ocr?: OcrConfig;              // per-call Tesseract tuning
  llm?: LlmConfig;              // per-call LLM credentials + model
  signal?: AbortSignal;         // per-call abort
}

Pluck's options are open-ended on purpose. Every extractor reads the options it cares about and ignores the rest – the same option block flows through the pipeline unchanged.


What you get back

Every extractor produces the same ExtractResult shape:

TypeScript
interface ExtractResult {
  text: string;                         // always present – clean text
  segments?: Segment[];                 // subtitles, PDF pages, regex captures
  data?: Record<string, unknown>;       // structured, when available
  pages?: Page[];                       // PDF / OCR per-page
  images?: ImageResult[];               // inline images when includeImages
  confidence: number;                   // 0..1 heuristic
  method: string;                       // "html:css" / "transcribe:whisper" / etc.
}

method is the single most useful field for debugging – it tells you which extractor won and which sub-strategy it used. Log it; put it in your traces.


Custom extractors

The extractor registry is open. Same pattern as connectors – defineExtractor gives you parameter inference inside canExtract and extract, and createPluck({ extractors: [...] }) registers at construction time:

TypeScript
import { createPluck, defineExtractor } from "@sizls/pluck";

const pricingTable = defineExtractor({
  name: "pricing-table",
  canExtract: (source) =>
    source.contentType.includes("text/html") &&
    source.url.includes("/pricing"),
  async extract(source, options) {
    const html = String(source.content);
    const tiers = Array.from(html.matchAll(/<tr class="tier">([\s\S]*?)<\/tr>/g));
    return {
      text: tiers.map((t) => t[1]).join("\n"),
      data: { tiers: tiers.map((t, i) => ({ position: i, raw: t[1] })) },
      confidence: 0.9,
      method: "pricing-table:regex",
    };
  },
});

const pluck = createPluck({ extractors: [pricingTable] });

canExtract gets called first for every extraction attempt. Return true when the NavigateResult is something your extractor understands. The registry picks the first match in order – custom extractors run before built-ins, whether you pass them through createPluck({ extractors }) or register post-construction via instance.extractors.register(pricingTable).

Want to inspect the registry or answer "who would extract this source?" without running anything?

TypeScript
pluck.extractors.list();           // ["pricing-table", "transcribe", "deserialize", ...]
pluck.extractors.findBest(source); // "pricing-table"
pluck.extractors.findAll(source);  // ["pricing-table", "html"] – fallback chain

Drift detection

Extraction is where web pages bite you – a selector breaks, a field disappears, an LLM starts returning a different shape. Pluck's extract phase itself stays silent about drift (it returns whatever it got), but the next phase – Shape – catches every drift moment with an onDrift signal against a Zod schema. The pair is the point: extract pulls, shape validates against your contract.

For contract testing today, extract + shape gives you a "selector changed in production" signal in ~15 lines. The roadmap includes pluck.verify(url, schema) as a CLI daemon – see Ideas backlog.


What's next

  • Shape – once you have extracted content, pin its shape with Zod.
  • Act – signed, reversible mutations on the same URIs.
  • Reference: CLI – every command that composes connect → extract → shape.
Edit this page on GitHub
Previous
Navigate
Next
Shape

Ready to build?

Install Pluck and follow the Quick Start guide to wire MCP-first data pipelines into your agents and fleets in minutes.

Get started →