How does VectorEO transform website content for AI?

Through a six-stage pipeline: crawl raw HTML, chunk at natural boundaries, annotate with entity linking, enrich with AI-extracted claims and metadata, store as vector embeddings, and export in multiple formats

What formats does VectorEO export data in?

MCP server, NDJSON dumps, JSON feeds, robots.txt, llms.txt, and .well-known/vectors.json

How does VectorEO handle web crawling?

Intelligent BFS crawling respects robots.txt, handles JavaScript rendering, and automatically deduplicates content

What key features does VectorEO offer?

Security, analytics, integrations, developer-focused tools, and enterprise management capabilities

How can I assess if my site is AI-ready?

VectorEO offers a free AI Readiness Scan with no credit card or signup required

From URL to AI-discoverable in six steps

Submit a URL. The pipeline crawls your site, chunks every page at semantic boundaries, enriches content with Claude, stores 384-dimensional embeddings in Qdrant, and exports 11 foundation files. AI agents find you because the data is structured for them.

https://yoursite.com

Read your siteEvery page, every link

Structure itClean, sized sections

Tag entitiesPeople, places, products

Add contextAudience, intent, claims

Make it searchableBy meaning, not keywords

Serve to AI agentsChatGPT, Claude, Perplexity

Read your siteEvery page, every link

Structure itClean, sized sections

Tag entitiesPeople, places, products

Add contextAudience, intent, claims

Make it searchableBy meaning, not keywords

Serve to AI agentsChatGPT, Claude, Perplexity

AI-Ready Content

Step 1

Crawl

We read your entire site. AI agents get structured data, not raw HTML.

BFS crawling follows robots.txt, renders JavaScript, and deduplicates pages. Sitemap-first discovery with priority scoring indexes your highest-value pages first. Incremental recrawling detects content changes via content-hash diffing and only reprocesses what changed.

Crawl4AI

Step 2

Chunk

Content split at semantic boundaries, not character limits.

Semantic splitting breaks pages into 120-to-180-word chunks that respect sentence and paragraph boundaries. No mid-sentence cuts. No orphaned paragraphs. Each chunk carries enough context to stand alone when an AI agent retrieves it from the vector database.

LlamaIndex

Step 3

Annotate

Every entity gets a Wikidata Q-ID.

spaCy's en_core_web_md model identifies people, companies, products, and concepts in each chunk. Each entity is linked to a Wikidata Q-ID, the same identifier system behind Wikipedia. AI agents use these IDs to connect your content to the global knowledge graph.

spaCy + Wikidata

Step 4

Enrich

Claude reads each page and tags it with 7 citeability signals.

Claude Haiku 4.5 reads each page (up to 8,000 characters) and extracts factual claims, generates Q&A pairs, classifies target audience, maps buyer journey stage, and detects content type. Each chunk gets a citeability score from 0 to 100, computed from 7 deterministic signals. Raw text becomes structured, citable knowledge.

Claude API

Step 5

Store

Searchable by meaning, not keywords.

384-dimensional vector embeddings (all-MiniLM-L6-v2) stored in isolated per-client Qdrant collections. Search for "eco-friendly packaging" and find your content about "sustainable shipping materials" even though the words don't match. Cross-encoder reranking (ms-marco-MiniLM-L-6-v2) surfaces the best results first.

Qdrant

Step 6

Export

Every file format AI agents look for, regenerated on each crawl.

MCP server, NDJSON dumps, JSON feeds, robots.txt, llms.txt, llms-full.txt, llms-lite.txt, .well-known/vectors.json, .well-known/ai.json, ai-content-index.json, Schema.org knowledge graph, and sitemap.xml. All regenerated after every crawl cycle. No manual file creation.

11+ formats

Beyond the pipeline

Everything you need to run AI content infrastructure in production: security, analytics, integrations, and team controls.

Security That Earns Trust

JWT + API key dual authentication on every request
TOTP two-factor authentication with encrypted backup codes
Per-client data isolation in Qdrant (separate vector collections, no shared data)
SSRF validation blocks internal network URLs on all user-submitted inputs
HMAC-SHA256 signed webhooks with delivery verification

Analytics That Show What AI Agents Want

See the exact queries AI agents run against your content
Content gap detection: find queries where your site has no good answer
Pipeline quality scorecard with citeability scores after each crawl
Revenue, usage, and job dashboards in the admin panel

Integrations Built on Open Standards

MCP server at /mcp: AI agents query your content over the Model Context Protocol (10 read-only tools)
Webhooks fire on job lifecycle events (started, completed, failed) with HMAC verification
REST API with OpenAPI 3.0 spec, interactive docs at /docs and /redoc
.well-known discovery endpoints for automatic AI agent registration

Built for Developers

Self-service API key management: create, rotate, revoke from the dashboard
Custom enrichment prompts per site to control how Claude tags your metadata
Semantic search + cross-encoder reranking available via REST API
Query expansion runs locally with zero external API cost

Notifications You Control

In-app notifications for every job status change
Email alerts when crawls fail, complete, or hit page limits
Per-category notification preferences (toggle what matters, mute what doesn't)
Webhook delivery logs with retry status for automated pipelines

Manage at Scale

Organization and team management with invite links and role-based access (owner, admin, member)
Immutable audit logs for every security-sensitive action
Bulk site import via CSV or JSON for agencies managing multiple clients
Feature flags and user impersonation for admin troubleshooting

See how AI-ready your site is

Free AI Readiness Scan checks 12 signals. No credit card. No signup. Results in 60 seconds.

Scan your site free