Crawl

@flow-state-dev/tools — Crawl a website starting from a root URL, following links breadth-first, and return markdown for each page found.

Why this exists

Sometimes you need more than a single page. Documentation sites, knowledge bases, competitor analysis, research across an entire domain. tools.crawl handles the mechanics: link discovery, BFS traversal, depth limits, page limits, URL filtering, and rate limiting.

Two providers:

Provider	How it works	Depth control	Page limits	URL filtering	Env var
Firecrawl	Managed API (async crawl)	Yes	Yes	Regex patterns	`FIRECRAWL_API_KEY`
Built-in	Local BFS with fetch + Readability	Yes	Yes	Glob patterns	None needed

The built-in crawler always works. It follows same-origin links using BFS, respects your depth and page limits, and rate-limits to 1 request per second to be polite. For production crawling at scale, Firecrawl handles JS rendering, anti-bot, and parallel crawling.

Basic usage

import { generator } from "@flow-state-dev/core";
import { tools } from "@flow-state-dev/tools";

const docsCrawler = generator({
  name: "docs-crawler",
  model: "anthropic/claude-sonnet-4-6",
  prompt: "Crawl documentation sites and build a knowledge summary.",
  tools: [tools.crawl()],
});

The LLM provides a root URL and optionally specifies how many pages and how deep to crawl. The tool returns markdown content for every page found.

Configuration

tools.crawl({
  // Force a specific provider
  provider: "firecrawl",  // "firecrawl" | "builtin"

  // Default limits (LLM can override per call)
  maxPages: 50,
  maxDepth: 3,

  // URL pattern filtering
  includePatterns: ["/docs/**", "/api/**"],
  excludePatterns: ["/docs/changelog/**", "/admin/**"],

  // JS rendering (Firecrawl only)
  waitForJS: true,

  // Explicit API keys
  keys: {
    firecrawl: "fc-...",
  },
})

What the LLM controls vs. what you configure

Some parameters are set at definition time (your code), others at call time (the LLM decides):

Parameter	Who sets it	Why
`maxPages`	Both — config sets default, LLM can override	Scope is a semantic decision. A research agent might want 50 pages, a fact-checker wants 5.
`maxDepth`	Both — config sets default, LLM can override	Same reasoning.
`includePatterns`	You (config only)	Safety boundary. You decide which parts of a site are fair game.
`excludePatterns`	You (config only)	Safety boundary. Keep admin, auth, and irrelevant sections out.
`waitForJS`	You (config only)	Infrastructure concern, not a semantic decision.
`provider`	You (config only)	Infrastructure concern.

Provider resolution

FIRECRAWL_API_KEY set?  →  Firecrawl (async crawl, JS rendering, parallel)
Always                  →  Built-in BFS (sequential, static HTML, rate-limited)

Like tools.fetch(), the crawl tool never throws "no provider available".

Output shape

{
  rootUrl: "https://docs.example.com",
  pages: [
    {
      url: "https://docs.example.com",
      title: "Documentation Home",
      markdown: "# Docs\n\nWelcome to...",
      metadata: {
        statusCode: 200,
        contentType: "text/html",
        wordCount: 523,
      },
      source: "builtin"
    },
    {
      url: "https://docs.example.com/getting-started",
      title: "Getting Started",
      markdown: "# Getting Started\n\n...",
      metadata: { statusCode: 200, contentType: "text/html", wordCount: 1847 },
      source: "builtin"
    },
    // ... more pages
  ],
  totalPages: 12,
  crawlDepth: 2,
  source: "builtin"
}

Each page in the pages array has the same shape as a tools.fetch() result.

How the built-in crawler works

The built-in provider uses breadth-first search:

Start at the root URL
Fetch the page, extract content with Readability + Turndown
Parse all <a href> links from the HTML
Filter to same-origin only (no external links)
Apply include/exclude patterns
Add new URLs to the queue at depth + 1
Continue until maxPages or maxDepth is reached

Key behaviors:

Same-origin only — the crawler won't follow links to external sites
1-second delay between requests to avoid hammering servers
URL normalization — trailing slashes and hash fragments are stripped to prevent duplicate visits
Graceful failure — if a single page fails (404, timeout, non-HTML), the crawler skips it and continues
Non-HTML skipped — PDFs, images, and other non-HTML responses are ignored

URL pattern matching

Patterns use glob syntax matched against the URL path:

tools.crawl({
  includePatterns: ["/docs/**"],     // Only crawl /docs/ and below
  excludePatterns: ["/docs/v1/**"],  // But skip the old v1 docs
})

Pattern	Matches	Doesn't match
`/docs/**`	`/docs/intro`, `/docs/api/ref`	`/blog/post`
`/docs/*`	`/docs/intro`	`/docs/api/ref` (too deep)
`/blog/2026-*`	`/blog/2026-01-post`	`/blog/2025-12-post`

If includePatterns is empty (the default), all same-origin pages are included. If excludePatterns is empty, nothing is excluded.

Direct provider constructors

import { firecrawlCrawl, builtinCrawl } from "@flow-state-dev/tools";

// Always use Firecrawl (throws if no API key)
const crawl = firecrawlCrawl({ keys: { firecrawl: "fc-..." } });

// Always use built-in BFS
const crawl = builtinCrawl();

Full example: search, fetch, crawl

All three tools compose naturally. The LLM picks the right tool for the task.

import { generator } from "@flow-state-dev/core";
import { tools } from "@flow-state-dev/tools";

const researcher = generator({
  name: "researcher",
  model: "anthropic/claude-sonnet-4-6",
  prompt: `You are a research assistant. 
Use search to find relevant sources.
Use fetch to read individual pages in full.
Use crawl when you need to understand an entire site or documentation set.`,
  tools: [
    tools.search(),
    tools.fetch(),
    tools.crawl({ maxPages: 30, maxDepth: 2 }),
  ],
});

Limitations of the built-in crawler

The built-in crawler is for development and prototyping. It works well for static HTML sites (documentation, blogs, wikis). It won't handle:

JavaScript-rendered pages — returns whatever the server sends as static HTML
Anti-bot protection — sites behind Cloudflare or similar will block it
Authentication — no cookie or session handling
robots.txt — not respected in Phase 1 (the 1-second rate limit provides basic politeness)

For production crawling, use Firecrawl. It handles all of the above.

Error handling

Scenario	Behavior
Root URL fails	Throws an error
Individual page fails	Skips it, continues crawling other pages
Non-HTML content	Skips it (only processes text/html)
External links	Ignored (same-origin only)
Infinite calendar/pagination	`maxPages` limit prevents runaway
Firecrawl timeout	Firecrawl SDK handles polling internally

Next steps

Fetch tool — for single page fetching
Tools overview — all available tools

Why this exists​

Basic usage​

Configuration​

What the LLM controls vs. what you configure​

Provider resolution​

Output shape​

How the built-in crawler works​

URL pattern matching​

Direct provider constructors​

Full example: search, fetch, crawl​

Limitations of the built-in crawler​

Error handling​

Next steps​