Skip to main content

Crawl

@flow-state-dev/tools — Crawl a website starting from a root URL, following links breadth-first, and return markdown for each page found.

Why this exists

Sometimes you need more than a single page. Documentation sites, knowledge bases, competitor analysis, research across an entire domain. tools.crawl handles the mechanics: link discovery, BFS traversal, depth limits, page limits, URL filtering, and rate limiting.

Two providers:

ProviderHow it worksDepth controlPage limitsURL filteringEnv var
FirecrawlManaged API (async crawl)YesYesRegex patternsFIRECRAWL_API_KEY
Built-inLocal BFS with fetch + ReadabilityYesYesGlob patternsNone needed

The built-in crawler always works. It follows same-origin links using BFS, respects your depth and page limits, and rate-limits to 1 request per second to be polite. For production crawling at scale, Firecrawl handles JS rendering, anti-bot, and parallel crawling.

Basic usage

import { generator } from "@flow-state-dev/core";
import { tools } from "@flow-state-dev/tools";

const docsCrawler = generator({
name: "docs-crawler",
model: "anthropic/claude-sonnet-4-6",
prompt: "Crawl documentation sites and build a knowledge summary.",
tools: [tools.crawl()],
});

The LLM provides a root URL and optionally specifies how many pages and how deep to crawl. The tool returns markdown content for every page found.

Configuration

tools.crawl({
// Force a specific provider
provider: "firecrawl", // "firecrawl" | "builtin"

// Default limits (LLM can override per call)
maxPages: 50,
maxDepth: 3,

// URL pattern filtering
includePatterns: ["/docs/**", "/api/**"],
excludePatterns: ["/docs/changelog/**", "/admin/**"],

// JS rendering (Firecrawl only)
waitForJS: true,

// Explicit API keys
keys: {
firecrawl: "fc-...",
},
})

What the LLM controls vs. what you configure

Some parameters are set at definition time (your code), others at call time (the LLM decides):

ParameterWho sets itWhy
maxPagesBoth — config sets default, LLM can overrideScope is a semantic decision. A research agent might want 50 pages, a fact-checker wants 5.
maxDepthBoth — config sets default, LLM can overrideSame reasoning.
includePatternsYou (config only)Safety boundary. You decide which parts of a site are fair game.
excludePatternsYou (config only)Safety boundary. Keep admin, auth, and irrelevant sections out.
waitForJSYou (config only)Infrastructure concern, not a semantic decision.
providerYou (config only)Infrastructure concern.

Provider resolution

FIRECRAWL_API_KEY set?  →  Firecrawl (async crawl, JS rendering, parallel)
Always → Built-in BFS (sequential, static HTML, rate-limited)

Like tools.fetch(), the crawl tool never throws "no provider available".

Output shape

{
rootUrl: "https://docs.example.com",
pages: [
{
url: "https://docs.example.com",
title: "Documentation Home",
markdown: "# Docs\n\nWelcome to...",
metadata: {
statusCode: 200,
contentType: "text/html",
wordCount: 523,
},
source: "builtin"
},
{
url: "https://docs.example.com/getting-started",
title: "Getting Started",
markdown: "# Getting Started\n\n...",
metadata: { statusCode: 200, contentType: "text/html", wordCount: 1847 },
source: "builtin"
},
// ... more pages
],
totalPages: 12,
crawlDepth: 2,
source: "builtin"
}

Each page in the pages array has the same shape as a tools.fetch() result.

How the built-in crawler works

The built-in provider uses breadth-first search:

  1. Start at the root URL
  2. Fetch the page, extract content with Readability + Turndown
  3. Parse all <a href> links from the HTML
  4. Filter to same-origin only (no external links)
  5. Apply include/exclude patterns
  6. Add new URLs to the queue at depth + 1
  7. Continue until maxPages or maxDepth is reached

Key behaviors:

  • Same-origin only — the crawler won't follow links to external sites
  • 1-second delay between requests to avoid hammering servers
  • URL normalization — trailing slashes and hash fragments are stripped to prevent duplicate visits
  • Graceful failure — if a single page fails (404, timeout, non-HTML), the crawler skips it and continues
  • Non-HTML skipped — PDFs, images, and other non-HTML responses are ignored

URL pattern matching

Patterns use glob syntax matched against the URL path:

tools.crawl({
includePatterns: ["/docs/**"], // Only crawl /docs/ and below
excludePatterns: ["/docs/v1/**"], // But skip the old v1 docs
})
PatternMatchesDoesn't match
/docs/**/docs/intro, /docs/api/ref/blog/post
/docs/*/docs/intro/docs/api/ref (too deep)
/blog/2026-*/blog/2026-01-post/blog/2025-12-post

If includePatterns is empty (the default), all same-origin pages are included. If excludePatterns is empty, nothing is excluded.

Direct provider constructors

import { firecrawlCrawl, builtinCrawl } from "@flow-state-dev/tools";

// Always use Firecrawl (throws if no API key)
const crawl = firecrawlCrawl({ keys: { firecrawl: "fc-..." } });

// Always use built-in BFS
const crawl = builtinCrawl();

Full example: search, fetch, crawl

All three tools compose naturally. The LLM picks the right tool for the task.

import { generator } from "@flow-state-dev/core";
import { tools } from "@flow-state-dev/tools";

const researcher = generator({
name: "researcher",
model: "anthropic/claude-sonnet-4-6",
prompt: `You are a research assistant.
Use search to find relevant sources.
Use fetch to read individual pages in full.
Use crawl when you need to understand an entire site or documentation set.`,
tools: [
tools.search(),
tools.fetch(),
tools.crawl({ maxPages: 30, maxDepth: 2 }),
],
});

Limitations of the built-in crawler

The built-in crawler is for development and prototyping. It works well for static HTML sites (documentation, blogs, wikis). It won't handle:

  • JavaScript-rendered pages — returns whatever the server sends as static HTML
  • Anti-bot protection — sites behind Cloudflare or similar will block it
  • Authentication — no cookie or session handling
  • robots.txt — not respected in Phase 1 (the 1-second rate limit provides basic politeness)

For production crawling, use Firecrawl. It handles all of the above.

Error handling

ScenarioBehavior
Root URL failsThrows an error
Individual page failsSkips it, continues crawling other pages
Non-HTML contentSkips it (only processes text/html)
External linksIgnored (same-origin only)
Infinite calendar/paginationmaxPages limit prevents runaway
Firecrawl timeoutFirecrawl SDK handles polling internally

Next steps