7.9 KiB
| summary | read_when | title | |||
|---|---|---|---|---|---|
| Design for an opt-in Firecrawl extension that adds search/scrape value without hardwiring Firecrawl into core defaults |
|
Firecrawl Extension Design |
Firecrawl Extension Design
Goal
Ship Firecrawl as an opt-in extension that adds:
- explicit Firecrawl tools for agents,
- optional Firecrawl-backed
web_searchintegration, - self-hosted support,
- stronger security defaults than the current core fallback path,
without pushing Firecrawl into the default setup/onboarding path.
Why this shape
Recent Firecrawl issues/PRs cluster into three buckets:
- Release/schema drift
- Several releases rejected
tools.web.fetch.firecrawleven though docs and runtime code supported it.
- Several releases rejected
- Security hardening
- Current
fetchFirecrawlContent()still posts to the Firecrawl endpoint with rawfetch(), while the main web-fetch path uses the SSRF guard.
- Current
- Product pressure
- Users want Firecrawl-native search/scrape flows, especially for self-hosted/private setups.
- Maintainers explicitly rejected wiring Firecrawl deeply into core defaults, setup flow, and browser behavior.
That combination argues for an extension, not more Firecrawl-specific logic in the default core path.
Design principles
- Opt-in, vendor-scoped: no auto-enable, no setup hijack, no default tool-profile widening.
- Extension owns Firecrawl-specific config: prefer plugin config over growing
tools.web.*again. - Useful on day one: works even if core
web_search/web_fetchseams stay unchanged. - Security-first: endpoint fetches use the same guarded networking posture as other web tools.
- Self-hosted-friendly: config + env fallback, explicit base URL, no hosted-only assumptions.
Proposed extension
Plugin id: firecrawl
MVP capabilities
Register explicit tools:
firecrawl_searchfirecrawl_scrape
Optional later:
firecrawl_crawlfirecrawl_map
Do not add Firecrawl browser automation in the first version. That was the part of PR #32543 that pulled Firecrawl too far into core behavior and raised the most maintainership concern.
Config shape
Use plugin-scoped config:
{
plugins: {
entries: {
firecrawl: {
enabled: true,
config: {
apiKey: "FIRECRAWL_API_KEY",
baseUrl: "https://api.firecrawl.dev",
timeoutSeconds: 60,
maxAgeMs: 172800000,
proxy: "auto",
storeInCache: true,
onlyMainContent: true,
search: {
enabled: true,
defaultLimit: 5,
sources: ["web"],
categories: [],
scrapeResults: false,
},
scrape: {
formats: ["markdown"],
fallbackForWebFetchLikeUse: false,
},
},
},
},
},
}
Credential resolution
Precedence:
plugins.entries.firecrawl.config.apiKeyFIRECRAWL_API_KEY
Base URL precedence:
plugins.entries.firecrawl.config.baseUrlFIRECRAWL_BASE_URLhttps://api.firecrawl.dev
Compatibility bridge
For the first release, the extension may also read existing core config at tools.web.fetch.firecrawl.* as a fallback source so existing users do not need to migrate immediately.
Write path stays plugin-local. Do not keep expanding core Firecrawl config surfaces.
Tool design
firecrawl_search
Inputs:
querylimitsourcescategoriesscrapeResultstimeoutSeconds
Behavior:
- Calls Firecrawl
v2/search - Returns normalized OpenClaw-friendly result objects:
titleurlsnippetsource- optional
content
- Wraps result content as untrusted external content
- Cache key includes query + relevant provider params
Why explicit tool first:
- Works today without changing
tools.web.search.provider - Avoids current schema/loader constraints
- Gives users Firecrawl value immediately
firecrawl_scrape
Inputs:
urlformatsonlyMainContentmaxAgeMsproxystoreInCachetimeoutSeconds
Behavior:
- Calls Firecrawl
v2/scrape - Returns markdown/text plus metadata:
titlefinalUrlstatuswarning
- Wraps extracted content the same way
web_fetchdoes - Shares cache semantics with web tool expectations where practical
Why explicit scrape tool:
- Sidesteps the unresolved
Readability -> Firecrawl -> basic HTML cleanupordering bug in coreweb_fetch - Gives users a deterministic “always use Firecrawl” path for JS-heavy/bot-protected sites
What the extension should not do
- No auto-adding
browser,web_search, orweb_fetchtotools.alsoAllow - No default onboarding step in
openclaw setup - No Firecrawl-specific browser session lifecycle in core
- No change to built-in
web_fetchfallback semantics in the extension MVP
Phase plan
Phase 1: extension-only, no core schema changes
Implement:
extensions/firecrawl/- plugin config schema
firecrawl_searchfirecrawl_scrape- tests for config resolution, endpoint selection, caching, error handling, and SSRF guard usage
This phase is enough to ship real user value.
Phase 2: optional web_search provider integration
Support tools.web.search.provider = "firecrawl" only after fixing two core constraints:
src/plugins/web-search-providers.tsmust load configured/installed web-search-provider plugins instead of a hardcoded bundled list.src/config/types.tools.tsandsrc/config/zod-schema.agent-runtime.tsmust stop hardcoding the provider enum in a way that blocks plugin-registered ids.
Recommended shape:
- keep built-in providers documented,
- allow any registered plugin provider id at runtime,
- validate provider-specific config via the provider plugin or a generic provider bag.
Phase 3: optional web_fetch provider seam
Do this only if maintainers want vendor-specific fetch backends to participate in web_fetch.
Needed core addition:
registerWebFetchProvideror equivalent fetch-backend seam
Without that seam, the extension should keep firecrawl_scrape as an explicit tool rather than trying to patch built-in web_fetch.
Security requirements
The extension must treat Firecrawl as a trusted operator-configured endpoint, but still harden transport:
- Use SSRF-guarded fetch for the Firecrawl endpoint call, not raw
fetch() - Preserve self-hosted/private-network compatibility using the same trusted-web-tools endpoint policy used elsewhere
- Never log the API key
- Keep endpoint/base URL resolution explicit and predictable
- Treat Firecrawl-returned content as untrusted external content
This mirrors the intent behind the SSRF hardening PRs without assuming Firecrawl is a hostile multi-tenant surface.
Why not a skill
The repo already closed a Firecrawl skill PR in favor of ClawHub distribution. That is fine for optional user-installed prompt workflows, but it does not solve:
- deterministic tool availability,
- provider-grade config/credential handling,
- self-hosted endpoint support,
- caching,
- stable typed outputs,
- security review on network behavior.
This belongs as an extension, not a prompt-only skill.
Success criteria
- Users can install/enable one extension and get reliable Firecrawl search/scrape without touching core defaults.
- Self-hosted Firecrawl works with config/env fallback.
- Extension endpoint fetches use guarded networking.
- No new Firecrawl-specific core onboarding/default behavior.
- Core can later adopt plugin-native
web_search/web_fetchseams without redesigning the extension.
Recommended implementation order
- Build
firecrawl_scrape - Build
firecrawl_search - Add docs and examples
- If desired, generalize
web_searchprovider loading so the extension can backweb_search - Only then consider a true
web_fetchprovider seam