mirror of https://github.com/openclaw/openclaw.git
261 lines
7.9 KiB
Markdown
261 lines
7.9 KiB
Markdown
---
|
|
summary: "Design for an opt-in Firecrawl extension that adds search/scrape value without hardwiring Firecrawl into core defaults"
|
|
read_when:
|
|
- Designing Firecrawl integration work
|
|
- Evaluating web_search/web_fetch plugin seams
|
|
- Deciding whether Firecrawl belongs in core or as an extension
|
|
title: "Firecrawl Extension Design"
|
|
---
|
|
|
|
# Firecrawl Extension Design
|
|
|
|
## Goal
|
|
|
|
Ship Firecrawl as an **opt-in extension** that adds:
|
|
|
|
- explicit Firecrawl tools for agents,
|
|
- optional Firecrawl-backed `web_search` integration,
|
|
- self-hosted support,
|
|
- stronger security defaults than the current core fallback path,
|
|
|
|
without pushing Firecrawl into the default setup/onboarding path.
|
|
|
|
## Why this shape
|
|
|
|
Recent Firecrawl issues/PRs cluster into three buckets:
|
|
|
|
1. **Release/schema drift**
|
|
- Several releases rejected `tools.web.fetch.firecrawl` even though docs and runtime code supported it.
|
|
2. **Security hardening**
|
|
- Current `fetchFirecrawlContent()` still posts to the Firecrawl endpoint with raw `fetch()`, while the main web-fetch path uses the SSRF guard.
|
|
3. **Product pressure**
|
|
- Users want Firecrawl-native search/scrape flows, especially for self-hosted/private setups.
|
|
- Maintainers explicitly rejected wiring Firecrawl deeply into core defaults, setup flow, and browser behavior.
|
|
|
|
That combination argues for an extension, not more Firecrawl-specific logic in the default core path.
|
|
|
|
## Design principles
|
|
|
|
- **Opt-in, vendor-scoped**: no auto-enable, no setup hijack, no default tool-profile widening.
|
|
- **Extension owns Firecrawl-specific config**: prefer plugin config over growing `tools.web.*` again.
|
|
- **Useful on day one**: works even if core `web_search` / `web_fetch` seams stay unchanged.
|
|
- **Security-first**: endpoint fetches use the same guarded networking posture as other web tools.
|
|
- **Self-hosted-friendly**: config + env fallback, explicit base URL, no hosted-only assumptions.
|
|
|
|
## Proposed extension
|
|
|
|
Plugin id: `firecrawl`
|
|
|
|
### MVP capabilities
|
|
|
|
Register explicit tools:
|
|
|
|
- `firecrawl_search`
|
|
- `firecrawl_scrape`
|
|
|
|
Optional later:
|
|
|
|
- `firecrawl_crawl`
|
|
- `firecrawl_map`
|
|
|
|
Do **not** add Firecrawl browser automation in the first version. That was the part of PR #32543 that pulled Firecrawl too far into core behavior and raised the most maintainership concern.
|
|
|
|
## Config shape
|
|
|
|
Use plugin-scoped config:
|
|
|
|
```json5
|
|
{
|
|
plugins: {
|
|
entries: {
|
|
firecrawl: {
|
|
enabled: true,
|
|
config: {
|
|
apiKey: "FIRECRAWL_API_KEY",
|
|
baseUrl: "https://api.firecrawl.dev",
|
|
timeoutSeconds: 60,
|
|
maxAgeMs: 172800000,
|
|
proxy: "auto",
|
|
storeInCache: true,
|
|
onlyMainContent: true,
|
|
search: {
|
|
enabled: true,
|
|
defaultLimit: 5,
|
|
sources: ["web"],
|
|
categories: [],
|
|
scrapeResults: false,
|
|
},
|
|
scrape: {
|
|
formats: ["markdown"],
|
|
fallbackForWebFetchLikeUse: false,
|
|
},
|
|
},
|
|
},
|
|
},
|
|
},
|
|
}
|
|
```
|
|
|
|
### Credential resolution
|
|
|
|
Precedence:
|
|
|
|
1. `plugins.entries.firecrawl.config.apiKey`
|
|
2. `FIRECRAWL_API_KEY`
|
|
|
|
Base URL precedence:
|
|
|
|
1. `plugins.entries.firecrawl.config.baseUrl`
|
|
2. `FIRECRAWL_BASE_URL`
|
|
3. `https://api.firecrawl.dev`
|
|
|
|
### Compatibility bridge
|
|
|
|
For the first release, the extension may also **read** existing core config at `tools.web.fetch.firecrawl.*` as a fallback source so existing users do not need to migrate immediately.
|
|
|
|
Write path stays plugin-local. Do not keep expanding core Firecrawl config surfaces.
|
|
|
|
## Tool design
|
|
|
|
### `firecrawl_search`
|
|
|
|
Inputs:
|
|
|
|
- `query`
|
|
- `limit`
|
|
- `sources`
|
|
- `categories`
|
|
- `scrapeResults`
|
|
- `timeoutSeconds`
|
|
|
|
Behavior:
|
|
|
|
- Calls Firecrawl `v2/search`
|
|
- Returns normalized OpenClaw-friendly result objects:
|
|
- `title`
|
|
- `url`
|
|
- `snippet`
|
|
- `source`
|
|
- optional `content`
|
|
- Wraps result content as untrusted external content
|
|
- Cache key includes query + relevant provider params
|
|
|
|
Why explicit tool first:
|
|
|
|
- Works today without changing `tools.web.search.provider`
|
|
- Avoids current schema/loader constraints
|
|
- Gives users Firecrawl value immediately
|
|
|
|
### `firecrawl_scrape`
|
|
|
|
Inputs:
|
|
|
|
- `url`
|
|
- `formats`
|
|
- `onlyMainContent`
|
|
- `maxAgeMs`
|
|
- `proxy`
|
|
- `storeInCache`
|
|
- `timeoutSeconds`
|
|
|
|
Behavior:
|
|
|
|
- Calls Firecrawl `v2/scrape`
|
|
- Returns markdown/text plus metadata:
|
|
- `title`
|
|
- `finalUrl`
|
|
- `status`
|
|
- `warning`
|
|
- Wraps extracted content the same way `web_fetch` does
|
|
- Shares cache semantics with web tool expectations where practical
|
|
|
|
Why explicit scrape tool:
|
|
|
|
- Sidesteps the unresolved `Readability -> Firecrawl -> basic HTML cleanup` ordering bug in core `web_fetch`
|
|
- Gives users a deterministic “always use Firecrawl” path for JS-heavy/bot-protected sites
|
|
|
|
## What the extension should not do
|
|
|
|
- No auto-adding `browser`, `web_search`, or `web_fetch` to `tools.alsoAllow`
|
|
- No default onboarding step in `openclaw setup`
|
|
- No Firecrawl-specific browser session lifecycle in core
|
|
- No change to built-in `web_fetch` fallback semantics in the extension MVP
|
|
|
|
## Phase plan
|
|
|
|
### Phase 1: extension-only, no core schema changes
|
|
|
|
Implement:
|
|
|
|
- `extensions/firecrawl/`
|
|
- plugin config schema
|
|
- `firecrawl_search`
|
|
- `firecrawl_scrape`
|
|
- tests for config resolution, endpoint selection, caching, error handling, and SSRF guard usage
|
|
|
|
This phase is enough to ship real user value.
|
|
|
|
### Phase 2: optional `web_search` provider integration
|
|
|
|
Support `tools.web.search.provider = "firecrawl"` only after fixing two core constraints:
|
|
|
|
1. `src/plugins/web-search-providers.ts` must load configured/installed web-search-provider plugins instead of a hardcoded bundled list.
|
|
2. `src/config/types.tools.ts` and `src/config/zod-schema.agent-runtime.ts` must stop hardcoding the provider enum in a way that blocks plugin-registered ids.
|
|
|
|
Recommended shape:
|
|
|
|
- keep built-in providers documented,
|
|
- allow any registered plugin provider id at runtime,
|
|
- validate provider-specific config via the provider plugin or a generic provider bag.
|
|
|
|
### Phase 3: optional `web_fetch` provider seam
|
|
|
|
Do this only if maintainers want vendor-specific fetch backends to participate in `web_fetch`.
|
|
|
|
Needed core addition:
|
|
|
|
- `registerWebFetchProvider` or equivalent fetch-backend seam
|
|
|
|
Without that seam, the extension should keep `firecrawl_scrape` as an explicit tool rather than trying to patch built-in `web_fetch`.
|
|
|
|
## Security requirements
|
|
|
|
The extension must treat Firecrawl as a **trusted operator-configured endpoint**, but still harden transport:
|
|
|
|
- Use SSRF-guarded fetch for the Firecrawl endpoint call, not raw `fetch()`
|
|
- Preserve self-hosted/private-network compatibility using the same trusted-web-tools endpoint policy used elsewhere
|
|
- Never log the API key
|
|
- Keep endpoint/base URL resolution explicit and predictable
|
|
- Treat Firecrawl-returned content as untrusted external content
|
|
|
|
This mirrors the intent behind the SSRF hardening PRs without assuming Firecrawl is a hostile multi-tenant surface.
|
|
|
|
## Why not a skill
|
|
|
|
The repo already closed a Firecrawl skill PR in favor of ClawHub distribution. That is fine for optional user-installed prompt workflows, but it does not solve:
|
|
|
|
- deterministic tool availability,
|
|
- provider-grade config/credential handling,
|
|
- self-hosted endpoint support,
|
|
- caching,
|
|
- stable typed outputs,
|
|
- security review on network behavior.
|
|
|
|
This belongs as an extension, not a prompt-only skill.
|
|
|
|
## Success criteria
|
|
|
|
- Users can install/enable one extension and get reliable Firecrawl search/scrape without touching core defaults.
|
|
- Self-hosted Firecrawl works with config/env fallback.
|
|
- Extension endpoint fetches use guarded networking.
|
|
- No new Firecrawl-specific core onboarding/default behavior.
|
|
- Core can later adopt plugin-native `web_search` / `web_fetch` seams without redesigning the extension.
|
|
|
|
## Recommended implementation order
|
|
|
|
1. Build `firecrawl_scrape`
|
|
2. Build `firecrawl_search`
|
|
3. Add docs and examples
|
|
4. If desired, generalize `web_search` provider loading so the extension can back `web_search`
|
|
5. Only then consider a true `web_fetch` provider seam
|