Building a Secure AI Chatbot for Internal Search: Lessons from NIST’s NCCoE

TL;DR:

NIST’s NCCoE RAG chatbot enables internal search across its published cybersecurity guidance.
Documented in NIST IR 8579 (initial public draft) and covers technical decisions, observed limitations, and risk-informed safeguards.

NIST’s National Cyber-security Centre of Excellence (NCCoE) recently shared its experience of building the NCCoE Chatbot to steer researchers through a large library of publications. The technical report offers useful guidance for organisations planning similar AI projects in sensitive environments.

Why build an internal chatbot?

Researchers needed to pinpoint specific instructions buried inside thousands of pages of technical guidance. Traditional keyword search produced too many results and left users to trawl through PDFs. A locally deployed Retrieval-Augmented Generation (RAG) chatbot now lets staff pose natural-language questions and receive concise answers with page references.

The technical approach

RAG architecture

RAG pairs a language model with a search layer so the model only generates answers from retrieved text. The pipeline runs as follows:

Document processing – Each publication is split into page-level JSON records.
Query matching – At run-time the system finds the most relevant chunks.
Response generation – The LLM drafts an answer using just those chunks.
Citation tracking – Page numbers are returned so users can inspect the source.

For readers who want the details (skip if not): every page is turned into a JSON object that carries the file name, URL and page number. Pages are subdivided into 512-token chunks, embedded with all-mpnet-base-v2 (768-dimension vectors) and stored in a Chroma index. At query time the retriever pulls the top three chunks before prompting the Llama model. These figures were chosen to fit the 64 GB Tesla V100 GPUs while still giving the model enough context.

Hardware and software stack

The chatbot runs on Tesla V100 GPUs (64 GB each) that were already available in-house. It is hosted on Ubuntu 20.04 inside a Python virtual environment. LlamaIndex provides the RAG scaffolding and Chroma stores embeddings.

Security considerations

Specific attention was given to common threats such as prompt injection, hallucinations (model misalignment), sensitive data exposure, and unauthorised access.

Hallucination mitigation

After the RAG pipeline drafts a reply, the same Llama model is called again to confirm that every statement is supported by the retrieved text. If validation fails, the response is withheld.

Prompt injection and input validation

Incoming and outgoing text is sanitised, models are downloaded directly from Meta (securing the llm supply chain), and only authenticated VPN users can reach the service. The team notes that sophisticated prompt-injection attacks remain possible and plans stronger guardrails.

Data privacy and local deployment

No query leaves the NCCoE network, which removes the risk of external model training on sensitive data. Local hosting, however, demands rigorous patching and monitoring.

Evolving attack surface

Planned conversation memory will let the chatbot handle follow-up questions but could enable cross-session context attacks or unintended retention of sensitive text. Secure logging is also on the roadmap to improve visibility.

Real-world performance

During testing the team asked: “What are the steps involved in executing active scans as described in NIST SP 1800-30C: Securing Telehealth Remote Patient Monitoring Ecosystem?”

The chatbot answered with a nine-step checklist that matched the original wording and included a hyperlink plus a page 39 citation. Meta AI and OpenAI ChatGPT replied with abstract advice, omitted page numbers and skipped several tool-specific clicks.

The authors highlight this example to show that RAG can deliver task-ready content in seconds and gives users immediate proof of accuracy.

Lessons for implementation

Define the use case clearly – NCCoE focused on retrieval within its own publications.
Match the model to the hardware – The team tried Llama 3.1 405 B but dropped to 70 B after GPU tests.
Design for your infrastructure – Hardware limits drive choices such as chunk size and top-k.
Measure quality early – A 100-question test set and the RAGAS framework track answer quality.
Think about user trust – Page-level citations make it easy to verify any response.

Future directions

The next release aims to add conversation memory, richer security logging, better multi-user performance and more guardrails against new attacks.

The bottom line

NIST shows that a secure, accurate RAG chatbot is achievable with careful engineering and frank acknowledgement of remaining risks. The public draft of NIST IR 8579 offers a valuable reference for any security team planning to bring AI search tools behind the firewall.