openclaw/docs/refactor/firecrawl-extension.md

7.9 KiB

summary read_when title
Design for an opt-in Firecrawl extension that adds search/scrape value without hardwiring Firecrawl into core defaults
Designing Firecrawl integration work
Evaluating web_search/web_fetch plugin seams
Deciding whether Firecrawl belongs in core or as an extension
Firecrawl Extension Design

Firecrawl Extension Design

Goal

Ship Firecrawl as an opt-in extension that adds:

  • explicit Firecrawl tools for agents,
  • optional Firecrawl-backed web_search integration,
  • self-hosted support,
  • stronger security defaults than the current core fallback path,

without pushing Firecrawl into the default setup/onboarding path.

Why this shape

Recent Firecrawl issues/PRs cluster into three buckets:

  1. Release/schema drift
    • Several releases rejected tools.web.fetch.firecrawl even though docs and runtime code supported it.
  2. Security hardening
    • Current fetchFirecrawlContent() still posts to the Firecrawl endpoint with raw fetch(), while the main web-fetch path uses the SSRF guard.
  3. Product pressure
    • Users want Firecrawl-native search/scrape flows, especially for self-hosted/private setups.
    • Maintainers explicitly rejected wiring Firecrawl deeply into core defaults, setup flow, and browser behavior.

That combination argues for an extension, not more Firecrawl-specific logic in the default core path.

Design principles

  • Opt-in, vendor-scoped: no auto-enable, no setup hijack, no default tool-profile widening.
  • Extension owns Firecrawl-specific config: prefer plugin config over growing tools.web.* again.
  • Useful on day one: works even if core web_search / web_fetch seams stay unchanged.
  • Security-first: endpoint fetches use the same guarded networking posture as other web tools.
  • Self-hosted-friendly: config + env fallback, explicit base URL, no hosted-only assumptions.

Proposed extension

Plugin id: firecrawl

MVP capabilities

Register explicit tools:

  • firecrawl_search
  • firecrawl_scrape

Optional later:

  • firecrawl_crawl
  • firecrawl_map

Do not add Firecrawl browser automation in the first version. That was the part of PR #32543 that pulled Firecrawl too far into core behavior and raised the most maintainership concern.

Config shape

Use plugin-scoped config:

{
  plugins: {
    entries: {
      firecrawl: {
        enabled: true,
        config: {
          apiKey: "FIRECRAWL_API_KEY",
          baseUrl: "https://api.firecrawl.dev",
          timeoutSeconds: 60,
          maxAgeMs: 172800000,
          proxy: "auto",
          storeInCache: true,
          onlyMainContent: true,
          search: {
            enabled: true,
            defaultLimit: 5,
            sources: ["web"],
            categories: [],
            scrapeResults: false,
          },
          scrape: {
            formats: ["markdown"],
            fallbackForWebFetchLikeUse: false,
          },
        },
      },
    },
  },
}

Credential resolution

Precedence:

  1. plugins.entries.firecrawl.config.apiKey
  2. FIRECRAWL_API_KEY

Base URL precedence:

  1. plugins.entries.firecrawl.config.baseUrl
  2. FIRECRAWL_BASE_URL
  3. https://api.firecrawl.dev

Compatibility bridge

For the first release, the extension may also read existing core config at tools.web.fetch.firecrawl.* as a fallback source so existing users do not need to migrate immediately.

Write path stays plugin-local. Do not keep expanding core Firecrawl config surfaces.

Tool design

Inputs:

  • query
  • limit
  • sources
  • categories
  • scrapeResults
  • timeoutSeconds

Behavior:

  • Calls Firecrawl v2/search
  • Returns normalized OpenClaw-friendly result objects:
    • title
    • url
    • snippet
    • source
    • optional content
  • Wraps result content as untrusted external content
  • Cache key includes query + relevant provider params

Why explicit tool first:

  • Works today without changing tools.web.search.provider
  • Avoids current schema/loader constraints
  • Gives users Firecrawl value immediately

firecrawl_scrape

Inputs:

  • url
  • formats
  • onlyMainContent
  • maxAgeMs
  • proxy
  • storeInCache
  • timeoutSeconds

Behavior:

  • Calls Firecrawl v2/scrape
  • Returns markdown/text plus metadata:
    • title
    • finalUrl
    • status
    • warning
  • Wraps extracted content the same way web_fetch does
  • Shares cache semantics with web tool expectations where practical

Why explicit scrape tool:

  • Sidesteps the unresolved Readability -> Firecrawl -> basic HTML cleanup ordering bug in core web_fetch
  • Gives users a deterministic “always use Firecrawl” path for JS-heavy/bot-protected sites

What the extension should not do

  • No auto-adding browser, web_search, or web_fetch to tools.alsoAllow
  • No default onboarding step in openclaw setup
  • No Firecrawl-specific browser session lifecycle in core
  • No change to built-in web_fetch fallback semantics in the extension MVP

Phase plan

Phase 1: extension-only, no core schema changes

Implement:

  • extensions/firecrawl/
  • plugin config schema
  • firecrawl_search
  • firecrawl_scrape
  • tests for config resolution, endpoint selection, caching, error handling, and SSRF guard usage

This phase is enough to ship real user value.

Phase 2: optional web_search provider integration

Support tools.web.search.provider = "firecrawl" only after fixing two core constraints:

  1. src/plugins/web-search-providers.ts must load configured/installed web-search-provider plugins instead of a hardcoded bundled list.
  2. src/config/types.tools.ts and src/config/zod-schema.agent-runtime.ts must stop hardcoding the provider enum in a way that blocks plugin-registered ids.

Recommended shape:

  • keep built-in providers documented,
  • allow any registered plugin provider id at runtime,
  • validate provider-specific config via the provider plugin or a generic provider bag.

Phase 3: optional web_fetch provider seam

Do this only if maintainers want vendor-specific fetch backends to participate in web_fetch.

Needed core addition:

  • registerWebFetchProvider or equivalent fetch-backend seam

Without that seam, the extension should keep firecrawl_scrape as an explicit tool rather than trying to patch built-in web_fetch.

Security requirements

The extension must treat Firecrawl as a trusted operator-configured endpoint, but still harden transport:

  • Use SSRF-guarded fetch for the Firecrawl endpoint call, not raw fetch()
  • Preserve self-hosted/private-network compatibility using the same trusted-web-tools endpoint policy used elsewhere
  • Never log the API key
  • Keep endpoint/base URL resolution explicit and predictable
  • Treat Firecrawl-returned content as untrusted external content

This mirrors the intent behind the SSRF hardening PRs without assuming Firecrawl is a hostile multi-tenant surface.

Why not a skill

The repo already closed a Firecrawl skill PR in favor of ClawHub distribution. That is fine for optional user-installed prompt workflows, but it does not solve:

  • deterministic tool availability,
  • provider-grade config/credential handling,
  • self-hosted endpoint support,
  • caching,
  • stable typed outputs,
  • security review on network behavior.

This belongs as an extension, not a prompt-only skill.

Success criteria

  • Users can install/enable one extension and get reliable Firecrawl search/scrape without touching core defaults.
  • Self-hosted Firecrawl works with config/env fallback.
  • Extension endpoint fetches use guarded networking.
  • No new Firecrawl-specific core onboarding/default behavior.
  • Core can later adopt plugin-native web_search / web_fetch seams without redesigning the extension.
  1. Build firecrawl_scrape
  2. Build firecrawl_search
  3. Add docs and examples
  4. If desired, generalize web_search provider loading so the extension can back web_search
  5. Only then consider a true web_fetch provider seam