From URL to AI-discoverable in six steps
Submit a URL. The pipeline crawls your site, chunks every page at semantic boundaries, enriches content with Claude, stores 384-dimensional embeddings in Qdrant, and exports 11 foundation files. AI agents find you because the data is structured for them.
Crawl
We read your entire site. AI agents get structured data, not raw HTML.
BFS crawling follows robots.txt, renders JavaScript, and deduplicates pages. Sitemap-first discovery with priority scoring indexes your highest-value pages first. Incremental recrawling detects content changes via content-hash diffing and only reprocesses what changed.
Crawl4AIChunk
Content split at semantic boundaries, not character limits.
Semantic splitting breaks pages into 120-to-180-word chunks that respect sentence and paragraph boundaries. No mid-sentence cuts. No orphaned paragraphs. Each chunk carries enough context to stand alone when an AI agent retrieves it from the vector database.
LlamaIndexAnnotate
Every entity gets a Wikidata Q-ID.
spaCy's en_core_web_md model identifies people, companies, products, and concepts in each chunk. Each entity is linked to a Wikidata Q-ID, the same identifier system behind Wikipedia. AI agents use these IDs to connect your content to the global knowledge graph.
spaCy + WikidataEnrich
Claude reads each page and tags it with 7 citeability signals.
Claude Haiku 4.5 reads each page (up to 8,000 characters) and extracts factual claims, generates Q&A pairs, classifies target audience, maps buyer journey stage, and detects content type. Each chunk gets a citeability score from 0 to 100, computed from 7 deterministic signals. Raw text becomes structured, citable knowledge.
Claude APIStore
Searchable by meaning, not keywords.
384-dimensional vector embeddings (all-MiniLM-L6-v2) stored in isolated per-client Qdrant collections. Search for "eco-friendly packaging" and find your content about "sustainable shipping materials" even though the words don't match. Cross-encoder reranking (ms-marco-MiniLM-L-6-v2) surfaces the best results first.
QdrantExport
Every file format AI agents look for, regenerated on each crawl.
MCP server, NDJSON dumps, JSON feeds, robots.txt, llms.txt, llms-full.txt, llms-lite.txt, .well-known/vectors.json, .well-known/ai.json, ai-content-index.json, Schema.org knowledge graph, and sitemap.xml. All regenerated after every crawl cycle. No manual file creation.
11+ formatsBeyond the pipeline
Everything you need to run AI content infrastructure in production: security, analytics, integrations, and team controls.
Security That Earns Trust
- JWT + API key dual authentication on every request
- TOTP two-factor authentication with encrypted backup codes
- Per-client data isolation in Qdrant (separate vector collections, no shared data)
- SSRF validation blocks internal network URLs on all user-submitted inputs
- HMAC-SHA256 signed webhooks with delivery verification
Analytics That Show What AI Agents Want
- See the exact queries AI agents run against your content
- Content gap detection: find queries where your site has no good answer
- Pipeline quality scorecard with citeability scores after each crawl
- Revenue, usage, and job dashboards in the admin panel
Integrations Built on Open Standards
- MCP server at /mcp: AI agents query your content over the Model Context Protocol (10 read-only tools)
- Webhooks fire on job lifecycle events (started, completed, failed) with HMAC verification
- REST API with OpenAPI 3.0 spec, interactive docs at /docs and /redoc
- .well-known discovery endpoints for automatic AI agent registration
Built for Developers
- Self-service API key management: create, rotate, revoke from the dashboard
- Custom enrichment prompts per site to control how Claude tags your metadata
- Semantic search + cross-encoder reranking available via REST API
- Query expansion runs locally with zero external API cost
Notifications You Control
- In-app notifications for every job status change
- Email alerts when crawls fail, complete, or hit page limits
- Per-category notification preferences (toggle what matters, mute what doesn't)
- Webhook delivery logs with retry status for automated pipelines
Manage at Scale
- Organization and team management with invite links and role-based access (owner, admin, member)
- Immutable audit logs for every security-sensitive action
- Bulk site import via CSV or JSON for agencies managing multiple clients
- Feature flags and user impersonation for admin troubleshooting
See how AI-ready your site is
Free AI Readiness Scan checks 12 signals. No credit card. No signup. Results in 60 seconds.
Scan your site free