Tested 3 Knowledge Base Setups So You Don’t Have To - Vapi

In “Tested 3 Knowledge Base Setups So You Don’t Have To – Vapi” you get a hands-on walkthrough of three ways to connect your AI assistant to a knowledge base, covering company FAQs, pricing details, and external product information. Henryk Brzozowski runs two rounds of calls so you can see which approach delivers the most accurate answers with the fewest hallucinations.

You’ll find side-by-side comparisons of an internal upload, an external Make.com call, and Vapi’s new query tool, along with the prompt setups, test process, timestamps for each result, and clear takeaways to help you pick the simplest, most reliable setup for your projects.

Table of Contents

Overview of the three knowledge base setups tested

High-level description of each setup tested

You tested three KB integration approaches: an internal upload where documents are ingested directly into your assistant’s environment, an external Make.com call where the assistant requests KB answers via a webhook or API orchestrated by Make.com, and Vapi’s query tool which connects to knowledge sources and handles retrieval and normalization before returning results to the LLM. Each approach represents a distinct architectural pattern: localized ingestion and retrieval, externalized orchestration, and a managed query service with built-in tooling.

Why these setups are representative of common approaches

These setups mirror the common choices you’ll make in real projects: you either store and index content within your own stack, call out to external automation platforms, or adopt a vendor-managed query layer like Vapi that abstracts retrieval. They cover tradeoffs between control, simplicity, latency, and maintainability, and therefore are representative for teams deciding where to put complexity and trust.

Primary goals of the tests and expected tradeoffs

Your primary goals were to measure factual accuracy, hallucination rate, latency, and citation precision across setups. You expected internal upload to yield better citation fidelity but require more maintenance, Make.com to be flexible but potentially slower and flaky under network constraints, and Vapi to offer convenience and normalization with some vendor lock-in and predictable behavior. The tests aimed to quantify those expectations.

How the video context and audience shaped experiment design

Because the experiment was presented in a short video aimed at builders and product teams, you prioritized clarity and reproducibility. Test cases reflected typical user queries—FAQs, pricing, and third-party docs—so that viewers could map results to their own use cases. You also designed rounds to be repeatable and to illustrate practical tweaks that a developer or product manager can apply quickly.

Tools, environment, and baseline components

Models and LLM providers used during testing

You used mainstream LLMs available at test time (open-source and API-based options were considered) to simulate real production choices. The goal was to keep the model layer consistent across setups to focus analysis on KB integration differences rather than model variability. This ensured that accuracy differences were due to retrieval and prompt engineering, not the underlying generative model.

Orchestration and automation tools including Make.com and Vapi

Make.com served as the external orchestrator that accepted webhooks, performed transformations, and queried external KB endpoints. Vapi was used for its new query tool that abstracts retrieval from multiple sources. You also used lightweight scripts and automation to run repeated calls and capture logs so you could compare latency, response formatting, and source citations across runs.

Knowledge base source types and formats used

The KB corpus included company FAQs, structured pricing tables, and third-party product documentation in a mix of PDFs, HTML pages, and markdown files. This variety tested both text extraction fidelity and retrieval relevance for different document formats, simulating the heterogeneous data you typically have to support.

Versioning, API keys, and environment configuration notes

You kept API keys and model versions pinned to ensure reproducibility across rounds, and documented environment variables and configuration files. Versioning for indexes and embeddings was tracked so you could roll back to prior setups. This disciplined configuration prevented accidental drift in results between round one and round two.

Test hardware, network conditions, and reproducibility checklist

Tests ran on a stable cloud instance with consistent network bandwidth and latency baselines to avoid noisy measurements. You recorded the machine type, region, and approximate network metrics in a reproducibility checklist so someone else could reasonably reproduce performance and latency figures. You also captured logs, request traces, and timestamps for each call.

Setup A internal upload: architecture and flow

How internal upload works end-to-end

With internal upload, you ingest KB files directly into your application: documents are parsed, chunked, embedded, and stored in your vector index. When a user asks a question, you perform a similarity search within that index, retrieve top passages, and construct a prompt that combines the retrieved snippets and the user query before sending it to the LLM for completion.

Data ingestion steps and file formats supported

Ingestion involved parsing PDFs, scraping or converting HTML, and accepting markdown and plain text. Files were chunked with sliding windows to preserve context, normalized to remove boilerplate, and then embedded. Metadata like document titles and source URLs were stored alongside embeddings to support precise citations.

Indexing, embedding, and retrieval mechanics

You used an embedding model to turn chunks into vectors and stored them in a vector store with approximate nearest neighbor search. Search returned the top N passages by similarity score; relevance tuning adjusted chunk size and overlap. The retrieval step included simple scoring thresholds and optional reranking to prioritize authoritative documents like official FAQs.

Typical prompt flow and where the KB is referenced

The prompt assembled a short system instruction, the user’s question, and the retrieved KB snippets annotated with source metadata. You instructed the LLM to answer only using the provided snippets and to cite sources verbatim. This direct inclusion keeps grounding tight and reduces the model’s tendency to hallucinate beyond what the KB supports.

Pros and cons for small to medium datasets

For small to medium datasets, internal upload gives you control, low external latency, and easier provenance for citations. However, you must maintain ingestion pipelines, update embeddings when content changes, and provision storage and compute for the index. It’s a good fit when you need predictable behavior and can afford the maintenance overhead.

Setup B external Make.com call: architecture and flow

How an external Make.com webhook or API call integrates with the assistant

In this approach the assistant calls a Make.com webhook with the user question and context. Make.com handles the retrieval logic, calling external APIs or databases and returning an enriched answer or raw content back to the assistant. The assistant then formats or post-processes the Make.com output before returning it to the user.

Data retrieval patterns and network round trips

Because Make.com acts as a middleman, each request typically involves multiple network hops: assistant → Make.com → KB or external API → Make.com → assistant. This yields more round trips and potential latency, especially for multi-step retrieval or enrichment workflows that call several endpoints.

Handling of rate limits, retries, and timeouts

You implemented retry logic, exponential backoff, and request throttling inside Make.com where possible, and at the assistant layer you detected timeouts and returned graceful fallback messages. Make.com provided some built-in throttling, but you still needed to plan for API rate limits from third-party sources and to design idempotent operations for reliable retries.

Using Make.com to transform or enrich KB responses

Make.com excels at transformation: you used it to fetch raw documents, extract structured fields like pricing tiers, normalize date formats, and combine results from multiple sources before returning a consolidated payload. This allowed the assistant to receive cleaner, ready-to-use context and reduced the amount of prompt engineering required to parse heterogeneous inputs.

Pros and cons for highly dynamic or externalized data

Make.com is attractive when your KB is highly dynamic or lives in third-party systems because it centralizes integration logic and can react quickly to upstream changes. The downsides are added latency, network reliability dependencies, and the need to maintain automation scenarios inside Make.com. It’s ideal when you want externalized control and transformation without reingesting everything into your local index.

Setup C Vapi query tool: architecture and flow

How Vapi’s query tool connects to knowledge sources and LLMs

Vapi’s query tool acts as a managed retrieval and normalization layer. You configured connections to your document sources, set retrieval policies, and then invoked Vapi from the assistant to run queries. Vapi returned normalized passages and metadata ready to be included in prompts or used directly in answer generation.

Built-in retrieval, caching, and result normalization features

Vapi provided built-in retrieval drivers for common document sources, automatic caching of recent queries to reduce latency, and normalization that standardized formats and flattened nested content. This reduced your need to implement custom extraction logic and helped create consistent, citation-ready snippets.

How prompts are assembled and tool-specific controls

The tool returned content with metadata that you could use to assemble prompts: you specified the number of snippets, maximum token lengths, and whether Vapi should prefilter for authority before returning results. These controls let you trade off comprehensiveness for brevity and guided how the LLM should treat the returned material.

When to choose Vapi for enterprise vs small projects

You should choose Vapi when you want a low-maintenance, scalable retrieval layer with features like caching and normalization—particularly useful for enterprises with many data sources or strict SLAs. For small projects, Vapi can be beneficial if you prefer not to build ingestion pipelines, but it may be overkill if your corpus is tiny and you prefer full local control.

Potential limitations and extension points

Limitations include dependence on a third-party service, potential costs, and constraints in customizing retrieval internals. Extension points exist via webhooks, pre/post-processing hooks, and the ability to augment Vapi’s returned snippets with your own business logic or additional verification steps.

Prompt engineering and guidance used across setups

Prompt templates and examples for factual Q&A

You used standardized prompt templates that included a brief system role, an instruction to answer only from provided sources, the retrieved snippets with source tags, and a user question. Example instructions forced the model to state “I don’t know” when the answer wasn’t supported, and to list exact source lines for any factual claim.

Strategies to reduce hallucination risk

To reduce hallucinations you constrained the model with explicit instructions to refuse to answer outside the retrieved content, used conservative retrieval thresholds, and added verification prompts that asked the model to point to the snippet that supports each claim. You also used token limits to prevent the model from inventing long unsupported explanations.

Context window management and how KB snippets are included

You managed the context window by summarizing or truncating less relevant snippets and including only the top-ranked passages. You prioritized source diversity and completeness while ensuring the prompt stayed within the model’s token budget. For longer queries, you used a short-chain approach: retrieve, summarize, then ask the model with the condensed context.

Fallback prompts and verification prompts used in tests

Fallback prompts asked the model to provide a short explanation of why it could not answer if retrieval failed, offering contact instructions or a suggestion to escalate. Verification prompts required the model to list which snippet supported each answer line and to mark any claim without a direct citation as uncertain.

How to tune prompts based on retrieval quality

If retrieval returned noisy or tangential snippets you tightened retrieval parameters, increased chunk overlap, and asked the model to ignore low-confidence passages. When retrieval was strong, you shifted to more concise prompts focusing on answer synthesis. The tuning loop involved adjusting both retrieval thresholds and prompt instructions iteratively.

Test methodology, dataset, and evaluation criteria

Composition of the test dataset: FAQs, pricing data, third-party docs

The test dataset included internal FAQs, pricing tables with structured tiers, and third-party product documentation to mimic realistic variance. This mix tested both the semantic retrieval of general knowledge and the precise extraction of structured facts like numbers and policy details.

Design of test queries including ambiguous and complex questions

Queries ranged from straightforward factual questions to ambiguous and multi-part prompts that required synthesis across documents. You included trick questions that could lure models into plausible-sounding but incorrect answers to expose hallucination tendencies.

Metrics used: accuracy, hallucination rate, precision of citations, latency

Evaluation metrics included answer accuracy (binary and graded), hallucination rate (claims without supporting citations), citation precision (how directly a cited snippet supported the claim), and latency from user question to final answer. These metrics gave a balanced view of correctness, explainability, and performance.

Manual vs automated labeling process and inter-rater checks

You used a mix of automated checks (matching returned claims against ground-truth snippets) and manual labeling for nuanced judgments like partial correctness. Multiple reviewers cross-checked samples to compute inter-rater agreement and to calibrate ambiguous cases.

Number of rounds, consistency checks, and statistical confidence

You ran two main rounds to test baseline behavior and effects of tuning, replaying the same query set to measure consistency. You captured enough runs per query to compute simple confidence bounds on metrics and to flag unstable behaviors that depended on random seed or network conditions.

Round one results: observations and key examples

Qualitative observations for each setup

In round one, internal upload produced consistent citations and fewer hallucinations but required careful chunking. Make.com delivered flexible, often context-rich results when the orchestration was right, but latency and occasional formatting inconsistencies were noticeable. Vapi showed strong normalization and citation clarity out of the box, with competitive latency thanks to caching.

Representative successful answers and where they came from

Successful answers for pricing tables often came from internal upload when the embedding matched the exact table chunk. Make.com excelled when it aggregated multiple sources for a composite answer, such as combining FAQ text with live API responses. Vapi produced crisp, citation-rich summaries of third-party docs thanks to its normalization.

Representative hallucinations and how they manifested

Hallucinations typically manifested as confidently stated numbers or policy statements that weren’t present in the snippets. These were more common when retrieval returned marginally relevant passages or when the prompt allowed the model to “fill in” missing pieces. Make.com occasionally returned enriched-text that introduced inferred claims during transformations.

Latency and throughput observations during the first round

Internal upload had the lowest median latency because it avoided external hops, though peak latency rose during heavy index queries. Make.com’s median latency was higher due to network round trips and orchestration steps. Vapi’s latency was competitive, with caching smoothing out repeat queries and lower variance.

Lessons learned and early adjustments before round two

You learned that stricter retrieval thresholds, more conservative prompt instructions, and better chunk metadata reduced hallucinations. For Make.com you added timeouts and better transformation rules. For Vapi you adjusted snippet counts and caching policies. These early fixes informed round two.

Round two results: observations and improvements

Changes applied between rounds and why

Between rounds you tightened prompt instructions, increased the minimum similarity threshold for retrieval, added verification prompts, and tuned Make.com transformations to avoid implicit inference. These changes were designed to reduce unsupported claims and to measure the setups’ ability to improve with conservative configurations.

How each setup responded to tuned prompts or additional context

Internal upload showed immediate improvement in citation precision because the stricter retrieval cut off noisy snippets. Make.com improved when you constrained transformations and returned raw passages instead of enriched summaries. Vapi responded well to stricter snippet limits and its normalized outputs made verification prompts more straightforward.

Improvement or regression in hallucination rates

Hallucination rates dropped across all setups, with the largest relative improvement for internal upload and Vapi. Make.com improved but still had residual hallucinations when transformation logic introduced inferred content. Overall, tightening the end-to-end pipeline reduced false claims significantly.

Edge case behavior observed with updated tests

Edge cases included long multi-part queries where context-window limitations forced truncation and partial answers; internal upload sometimes returned fragmented citations, Make.com occasionally timed out on complex aggregations, and Vapi sometimes over-normalized nuanced third-party language, smoothing out important qualifiers.

Final artifacts and test logs captured for reproducibility

You captured final logs, configuration manifests, prompt templates, and versioned indexes so others could reproduce the rounds. Test artifacts included sample queries, expected answers, and the exact responses from each setup along with timestamps and environment notes.

Conclusion

Summary of what the three tests revealed about tradeoffs

The tests showed clear tradeoffs: internal upload gives you the best control and provenance for small-to-medium corpora; Make.com gives integration flexibility and powerful transformation capabilities at the cost of latency and potential inference-related hallucinations; Vapi offers a balanced, lower-maintenance path with strong normalization and caching but introduces a dependency on a managed service.

Key decision points for teams choosing a KB integration path

Your decision should hinge on control vs convenience, dataset size, update frequency, and tolerance for external dependencies. If you need precise citations and full ownership, prefer internal upload. If you need orchestration across many external services and transformations, Make.com is compelling. If you want a managed retrieval layer with normalization and caching, Vapi is a strong choice.

Practical next steps to replicate the tests and adapt results

To replicate, prepare a heterogeneous KB, pin model and API versions, document environment variables, and run two rounds: baseline and tuned. Use the prompt templates and verification strategies you tested, collect logs, and iterate on retrieval thresholds. Start small and scale as you validate accuracy and latency tradeoffs in your environment.

Final considerations about maintenance, UX, and long-term accuracy

Think about maintenance burden—indexes need refreshing, transformation logic needs updating, and managed services evolve. UX matters: present citations clearly, handle “I don’t know” outcomes gracefully, and surface confidence. For long-term accuracy, build monitoring that tracks hallucination trends and automate re-ingestion or retraining of retrieval layers to keep your assistant trustworthy as content changes.

If you want to implement Chat and Voice Agents into your business to reduce missed calls, book more appointments, save time, and make more revenue, book a discovery call here: https://brand.eliteaienterprises.com/widget/bookings/elite-ai-30-min-demo-call

Tested 3 Knowledge Base Setups So You Don’t Have To – Vapi