Citation-grade knowledge infrastructure for AI agents

EchoKB turns messy legal, regulatory, and public sources into agent-ready Markdown, search, retrieval, and citations. We maintain the source pipelines so your agent doesn't hallucinate citations or improvise claims.

Markdown-firstCitation-readyHigh-recall BM25
GET /v1/search?q=...&limit=50
{
"results": [...],
"citations": [{
"doc_id": "hcj-1234/22",
"url": "https://...",
"snippet": "...",
"paragraph": 14
}]
}
Every claim traces back to a source document.

AI agents fail when their knowledge layer is brittle

Six things that break agent products in production — and what EchoKB owns so your team doesn't have to.

Source websites change without warning

Your scraper worked yesterday. Today the markup shifted, pagination broke, and the agent is silently missing documents.

PDFs and DOCX files are inconsistent

Different fonts, layouts, scan quality, and column structures. Generic parsers leave you with garbage text.

OCR misses Hebrew, legal, and multi-column layouts

Off-the-shelf OCR was trained for English receipts, not court rulings or regulatory filings in RTL scripts.

Citations are hard to verify under scrutiny

Lawyers and auditors won't accept hallucinated references. Every claim needs a traceable source, paragraph, and date.

One-off crawlers rot the moment they ship

Custom scripts per source pile up. Nobody owns them. They quietly fail until a customer complains.

Engineering time goes to scraping, not the agent

Your team should be improving the agent's reasoning and UX — not debugging selectors and PDF parsers.

Map → Crawl → Extract → Index → Serve

One pipeline per source family, maintained end-to-end. You bring the agent; we bring the knowledge supply chain.

1

Map

We discover where the knowledge actually lives — index pages, archives, paginated lists — and how to keep finding new documents as the source publishes them.

Source mappers, sitemaps, RSS, custom discovery
2

Crawl

Fetch and download documents at scale, respecting rate limits and robots, retrying transient failures, staying fresh.

Scheduled & webhook-triggered refresh
3

Extract

OCR scanned PDFs, parse DOCX, normalize HTML — all converted to clean Markdown the agent can actually read.

OCR, layout-aware PDF parsing, Markdown normalization
4

Index

BM25 full-text search with morphology-aware Hebrew stemming and metadata filters. Tuned for high recall, not just top-5.

BM25, custom stemmers, metadata filters
5

Serve

Agent-friendly tools: search, retrieve_document, get_citation, list_sources, check_freshness. Drop them into your agent runtime.

REST API, OpenAPI spec, agent-tool definitions

Built for agents, not human readers

Search, retrieval, and citation primitives shaped around how agents actually consume knowledge — not a search box for humans.

High-recall BM25 search

Return 50–200 candidates so your agent can rerank and reason — not just top-5 designed for human eyes.

Markdown retrieval

Clean Markdown your agent can actually read. No HTML soup, no PDF artifacts, no broken layout.

Citation-ready snippets

Every result ships with source URL, document ID, paragraph reference, and date — ready to surface in your agent's answers.

Source provenance

Every claim traces back to an original document. Lawyers and auditors can verify; users can trust.

Metadata filters

Narrow by jurisdiction, date, court, source family, document type. Your agent picks the right corpus before it searches.

Maintained pipelines

When the source website changes, we fix it. You don't get a 3am Slack ping about a broken selector.

Built from real agent infrastructure

PolyLM

Origin product

EchoKB is the citation layer extracted from PolyLM, our cited Hebrew legal search product. It already powers retrieval across verdicts, laws, and regulatory decisions — proven on a domain where citations get scrutinized by lawyers.

  • • Hebrew-aware morphology and stemming, tuned on legal corpora
  • • Court rulings, primary legislation, regulator publications
  • • Citation-grade answers that hold up in front of practitioners
Detailed case study coming soon

Bring us one source. We'll make it agent-ready.

Tell us what your agent needs to search and cite. We'll scope a Mapper Pilot — typically 3–4 weeks from messy source to a citation-safe API.