Skip to content

Core Concepts

Shape

The fourth phase of the Pluck pipeline. Zod in, contract-validated data out. Drift caught automatically.


The mental model

Connect gives you bytes. Extract gives you text and loosely-typed data. Shape's job is to pin that loose data to the exact schema your database, your API, your agent expects – and to fail loudly when the upstream source changes.

The pipeline looks like this:

Connect → Navigate → Extract → Shape → Output

Shape solves three problems that crash production at 2am:

  1. "The LLM added a _reasoning field and our Supabase upsert failed." Strict mode strips every field that's not in the schema.
  2. "The third-party API renamed game.home.score to game.homeTeam.score and we didn't notice for a week." The onDrift callback fires as soon as a required field goes missing or a previously-present field disappears.
  3. "We need total_score = home + away in the database, not the input." compute functions add derived fields deterministically, before validation.

Shape is a phase, not a standalone verb. You run it via the shape() function exported from @sizls/pluck. It doesn't fetch anything; it transforms data you already have.

TypeScript
import { shape } from "@sizls/pluck";
import { z } from "zod";

const gameSchema = z.object({
  home_score: z.number(),
  away_score: z.number(),
  total: z.number(),
});

const result = shape(rawGameData, {
  schema: gameSchema,
  map: {
    "game.homeTeam.score": "home_score",
    "game.awayTeam.score": "away_score",
  },
  compute: {
    total: (data) => Number(data.home_score) + Number(data.away_score),
  },
});

if (result.valid) {
  // result.data: { home_score: 3, away_score: 1, total: 4 }
  await supabase.from("scores").upsert(result.data);
}

Two modes

Shape runs in one of two modes, decided by whether you pass a map:

Mode 1 – Schema only (validate LLM output)

No mapping. Pluck takes the data you pass in, validates it against the schema, strips anything the schema doesn't know about, and returns the clean result:

TypeScript
import { shape } from "@sizls/pluck";
import { z } from "zod";

const productSchema = z.object({
  title: z.string(),
  price: z.number(),
});

// LLM returned { title: "Widget", price: 29.99, _reasoning: "..." }
const result = shape(llmOutput, { schema: productSchema });
// result.data: { title: "Widget", price: 29.99 }
// result.stripped: ["_reasoning"]

This is the common path for LLM-based extraction. The LLM spits out whatever it feels like; Shape makes sure only what you asked for reaches the database.

Mode 2 – Schema + field map (deterministic API mapping)

When the source API has a known shape, you can deterministically rename and transform fields before validation – no LLM involved. This is the fastest and cheapest mode:

TypeScript
const result = shape(apiResponse, {
  schema: z.object({
    id: z.string(),
    title: z.string(),
    minutes: z.number(),
  }),
  map: {
    "data.id": "id",
    "data.attributes.title": "title",
    "data.attributes.duration_ms": {
      to: "minutes",
      transform: (ms) => Math.floor(Number(ms) / 60_000),
    },
  },
});

Dot-path access works for any depth of nesting. The transform hook receives the raw value and can coerce / normalize before validation.

defineShape() for reusable configs

When a shape is worth committing to disk (or importing across files), defineShape is the typed authoring helper:

TypeScript
import { defineShape, shape } from "@sizls/pluck";
import { z } from "zod";

export const gameShape = defineShape({
  schema: z.object({
    home_score: z.number(),
    away_score: z.number(),
    total: z.number(),
  }),
  map: {
    "game.homeTeam.score": "home_score",
    "game.awayTeam.score": "away_score",
  },
  compute: {
    total: (d) => Number(d.home_score) + Number(d.away_score),
  },
});

// Later:
const result = shape(apiResponse, gameShape);
// result.data narrows to { home_score: number; away_score: number; total: number }

defineShape is generic-forwarded, so the Zod schema's inferred type flows through to result.data without any manual casting.

Typed pluck<T>(uri, { shape }) at the top level

The same inference works one level up – call pluck() with a Zod-backed shape.schema and result.data is typed as z.infer<typeof schema> without a separate shape() step. No as casts, no non-null assertions.

TypeScript
import { pluck } from "@sizls/pluck";
import { z } from "zod";

const Post = z.object({
  title: z.string(),
  author: z.string(),
  publishedAt: z.string(),
});

const { data, shape } = await pluck("https://blog.example.com/posts/42", {
  shape: { schema: Post },
});

if (shape?.valid) {
  // data is z.infer<typeof Post> – typed end-to-end, no cast
  console.log(data.title, data.author);
}

When shape is omitted, data stays Record<string, unknown> – old callers see no change.


Drift detection

The onDrift callback is the thing no other pipeline library ships. It fires in two cases:

  • Success-drift – validation passed but strict mode stripped fields. You get a list of keys that were present in the input but NOT in the schema. Useful for spotting a new field a vendor added (maybe a PII column you want to cover) before someone uses it in production.
  • Failure-drift – validation failed. Some required field went missing, a type flipped, a new required field appeared upstream. You get the Zod-derived paths and messages.
TypeScript
shape(input, {
  schema: productSchema,
  onDrift: (stripped, errors) => {
    if (errors) {
      // Upstream broke – page oncall.
      slack.post("#oncall", `Shape failed: ${errors.map((e) => e.path).join(", ")}`);
    } else {
      // Upstream added new fields – we're stripping them silently.
      slack.post("#data-drift", `Stripped: ${stripped.join(", ")}`);
    }
  },
});

Prefer a stderr warning over wiring a callback? Set warnOnDrift: true:

[Pluck] Shape stripped 2 fields: [_reasoning, _confidence]. Add to schema or set strict: false.
[Pluck] Shape validation failed with 1 error: [game.total]. Schema drift or upstream API change?

This pair – Extract for loose pulls, Shape for strict contracts – is what the community has historically stapled together with Zod + MSW + custom scripts. Pluck makes it one line.


Schema inference from a live API

Writing Zod schemas by hand for a 40-field API payload is tedious. The CLI can do it for you:

Shell
pluck shape --from-api https://api.github.com/repos/vercel/next.js -o github-repo.shape.ts

That emits ready-to-commit TypeScript:

TypeScript
import { z } from "zod";

export const githubRepo = z.object({
  id: z.number(),
  name: z.string(),
  full_name: z.string(),
  owner: z.object({
    login: z.string(),
    id: z.number(),
    avatar_url: z.string().url(),
  }),
  created_at: z.string().datetime(),
  // …
}).partial();

export type GithubRepo = z.infer<typeof githubRepo>;

Sample multiple live responses for better optionality/nullability detection by calling inferZodSchema({ samples: [resp1, resp2, resp3] }) programmatically – the CLI currently fetches a single response. The generated file is a starting point – commit it, edit it, tighten the constraints.


Social ETL templates

Pluck ships pre-built Zod schemas for common social APIs so the first 90% of the shape work is already done:

TypeScript
import { pluck, shape, spotifyTrack } from "@sizls/pluck";

const raw = await pluck("spotify://track/3n3Ppam7vgaVa1iaRUc9Lp");
const typed = shape(raw.data!, { schema: spotifyTrack });
// typed.data is z.infer<typeof spotifyTrack> – fully typed for upsert

Built-in templates:

  • spotifyTrack – from the Spotify connector
  • twitchClip – from the Twitch connector
  • instagramPost – Instagram oEmbed
  • tiktokPost – TikTok oEmbed
  • vimeoVideo – Vimeo oEmbed
  • twitterTweet – Twitter syndication

Every template is just a plain Zod schema – extend it with .extend({ … }), narrow it with .pick({ … }), whatever you need.


The --diff flag

When you're iterating on a shape config, you want to see what it actually does to your data. formatShapeDiff(result) returns a coloured summary:

Shell
pluck shape --diff ./game.shape.ts ./api-response.json
 Shape diff
   kept     3  home_score · away_score · total
   renamed  2  game.homeTeam.score → home_score · game.awayTeam.score → away_score
   computed 1  total
   stripped 2  _llm · _confidence

Perfect for "why did my upsert shrink from 40 fields to 3?" debugging.


ShapeResult

Every call returns the same shape:

TypeScript
interface ShapeResult<T> {
  data: T;                                 // Validated, narrowed data
  valid: boolean;
  stripped?: string[];                     // Strict-mode removed keys
  errors?: ShapeError[];                   // Zod errors on failure
  renamed?: { from: string; to: string }[];
  computed?: string[];
  provenance?: Record<string, FieldProvenance>; // Per-field lineage
}

interface FieldProvenance {
  via: "map" | "compute" | "passthrough";
  from?: string; // source dot-path for "map", source key for "passthrough"
}

At runtime, data is undefined when valid is false (the TypeScript type stays T to keep the happy path clean – always branch on result.valid before reading).

Per-field provenance

provenance answers "where did this value come from?" for every field that made it into result.data:

TypeScript
const result = shape(
  { game: { homeTeam: { score: 3 }, awayTeam: { score: 1 } } },
  {
    schema: z.object({
      home_score: z.number(),
      away_score: z.number(),
      total: z.number(),
    }),
    map: {
      "game.homeTeam.score": "home_score",
      "game.awayTeam.score": "away_score",
    },
    compute: { total: (d) => Number(d.home_score) + Number(d.away_score) },
  },
);

result.provenance;
// {
//   home_score: { via: "map", from: "game.homeTeam.score" },
//   away_score: { via: "map", from: "game.awayTeam.score" },
//   total:      { via: "compute" },
// }

Use it to populate traces, to answer "why is this field 3?" when you're debugging a bad upsert, or to feed telemetry that tracks which input paths your pipeline actually reads. Present only when valid is true.


Full runnable example

The smallest end-to-end shape program – a typed pluck<T>({ shape }) call against a Zod schema, inferred return type, drift detection on the strict-mode stripped keys. Opens in a fresh StackBlitz sandbox.


What's next

  • Act – once data is shaped, take action with a signed receipt.
  • Recipe: Shape Spotify – a full worked example.
  • Sense – the DSP side of the pipeline.
Edit this page on GitHub
Previous
Extract
Next
Act

Ready to build?

Install Pluck and follow the Quick Start guide to wire MCP-first data pipelines into your agents and fleets in minutes.

Get started →