EchoKB turns messy legal, regulatory, and public sources into agent-ready Markdown, search, retrieval, and citations. We maintain the source pipelines so your agent doesn't hallucinate citations or improvise claims.
Six things that break agent products in production — and what EchoKB owns so your team doesn't have to.
Your scraper worked yesterday. Today the markup shifted, pagination broke, and the agent is silently missing documents.
Different fonts, layouts, scan quality, and column structures. Generic parsers leave you with garbage text.
Off-the-shelf OCR was trained for English receipts, not court rulings or regulatory filings in RTL scripts.
Lawyers and auditors won't accept hallucinated references. Every claim needs a traceable source, paragraph, and date.
Custom scripts per source pile up. Nobody owns them. They quietly fail until a customer complains.
Your team should be improving the agent's reasoning and UX — not debugging selectors and PDF parsers.
One pipeline per source family, maintained end-to-end. You bring the agent; we bring the knowledge supply chain.
We discover where the knowledge actually lives — index pages, archives, paginated lists — and how to keep finding new documents as the source publishes them.
Source mappers, sitemaps, RSS, custom discoveryFetch and download documents at scale, respecting rate limits and robots, retrying transient failures, staying fresh.
Scheduled & webhook-triggered refreshOCR scanned PDFs, parse DOCX, normalize HTML — all converted to clean Markdown the agent can actually read.
OCR, layout-aware PDF parsing, Markdown normalizationBM25 full-text search with morphology-aware Hebrew stemming and metadata filters. Tuned for high recall, not just top-5.
BM25, custom stemmers, metadata filtersAgent-friendly tools: search, retrieve_document, get_citation, list_sources, check_freshness. Drop them into your agent runtime.
REST API, OpenAPI spec, agent-tool definitionsSearch, retrieval, and citation primitives shaped around how agents actually consume knowledge — not a search box for humans.
Return 50–200 candidates so your agent can rerank and reason — not just top-5 designed for human eyes.
Clean Markdown your agent can actually read. No HTML soup, no PDF artifacts, no broken layout.
Every result ships with source URL, document ID, paragraph reference, and date — ready to surface in your agent's answers.
Every claim traces back to an original document. Lawyers and auditors can verify; users can trust.
Narrow by jurisdiction, date, court, source family, document type. Your agent picks the right corpus before it searches.
When the source website changes, we fix it. You don't get a 3am Slack ping about a broken selector.
EchoKB is the citation layer extracted from PolyLM, our cited Hebrew legal search product. It already powers retrieval across verdicts, laws, and regulatory decisions — proven on a domain where citations get scrutinized by lawyers.
Tell us what your agent needs to search and cite. We'll scope a Mapper Pilot — typically 3–4 weeks from messy source to a citation-safe API.