How Enterprises can Overcome Security Challenges with LLMs
Deploying LLMs in the enterprise is a unique challenge due to issues in data privacy, hallucinations, and resource constraints. In this post we walk through how enterprises can overcome these obstacles and evaluate vendors.
According to recent surveys, 81% of enterprise want to leverage AI tools like ChatGPT. After all, LLMs have offered a tantalizing taste of increased productivity. Yet, only 21% of enterprises have LLMs deployed in production. Deploying LLMs in the enterprise is a unique challenge due to issues in data privacy, hallucinations, and resource constraints. In this post we walk through how enterprises can overcome these obstacles and evaluate vendors.
For everyday users, data privacy doesn’t factor into the decision to use ChatGPT or Bard. The risks are far outweighed by the usefulness of the tool. For enterprises, this is a vastly different calculation.
To make LLMs truly useful in a business setting, the models need access to private and extremely valuable data about the business’ services, customers, processes, and more (depending on the use case). That data either needs to be encoded in the LLM’s weights via training, or passed in during inference via in-context learning.
With customer data, the concern is obvious - violating MSAs and privacy can have irreparable reputational and financial consequences. Here’s a simplified example of why this is dangerous even when customer data isn’t involved.
Employee at a hotel chain X uses ChatGPT to help them write an onboarding doc for a new hire explaining their unique procedures for customer service. They copy-paste content from an internal document as part of the prompt. That data passes into OpenAI’s training pipeline for their underlying model, GPT-4.
When employee at hotel chain Y asks ChatGPT for ways to improve customer service, they get an answer synthesized from chain X’s documents! Chain X’s competitive advantage has been leaked, and this could have long term impacts on their business.
In more sophisticated use cases relying on semantic search and vector embeddings, the danger is 10x’d. In many of these solutions, entire documents are run through a provider’s LLMs for embedding, and then again when reranking search results.
If any of the models seeing your data are run by the large providers - OpenAI, Google, Cohere, Anthropic, or others - you may be at risk.
The unique concern here is that these models are massive, and are shared among millions of users and thousands of organizations. The data you pass is in is being used not just for your benefit, but sometimes for the benefit of everyone.
We recommend choosing solutions that are built on open source models, tailored to your use case. It’s much easier to ensure your privacy when models are not massively general purpose and are not serving the general public. Make sure to have provisions in your MSAs to prevent your sensitive data from leaking.
While it’s more difficult to build products using your own models (compared to calling a public API), you should expect that the solutions you procure are coming from teams that are capable to train, host, and deploy these models for your use case. You’ll typically need to ensure that the generation model, embedding model, and reranking model are ALL meeting your privacy obligations.
If you’re asking an LLM to do work for you, you need to know that it’s doing it accurately. LLMs have very little understanding of what they don’t know, and that leads to responses that seem right, but are wrong in the details.
If your team is using ChatGPT to help with tasks, beware that you’ll have to extremely carefully proofread and fact-check their output.
Inaccurate information coming from LLMs, even if just for internal use, can have massive impacts on productivity and finances.
Employee X asks ChatGPT to write a document for a new hire explaining how to enable a feature in their software. The LLM synthesizes information from 1000s of onboarding docs in its training data, and writes a convincing, but inaccurate walkthrough. With little proofreading, X sends this to the new hire, who later references this document when updating support documentation.
We recommend using in-context learning for a majority of tasks that rely on external data. When LLMs are passed data directly in the prompt, they are less likely to hallucinate. This has been demonstrated repeatedly in the literature. This provides additional assurances of the source data answers are based upon.
You should also demand citations - when an LLM relies on a piece of source data, the source data should be cited. Note that the citation should not be generated by the LLM in the response, but displayed to users directly.
Hallucinations can be mitigated through a variety of strategies, including better indexing and embedding models, better queries, confidence thresholds, and strong reranking models. If your provider can’t explain the techniques they are using to you, don’t buy the product!
A human should always be involved as a reviewer with LLM generated outputs. We don’t recommend giving LLMs the ability to send emails or write content without a qualified human to review the content first.
LLMs are amazing - but difficult to build and deploy. The best models require machine learning and data science experience to build, and that talent is difficult to find. Even if you’re able to find the data and train an effective model, hosting the model can be extremely challenging. For 30-40 billion parameter models, you’ll need 8 A100 GPUs, or a rare H100. This isn’t cheap, and difficult to get working reliably even once you’ve secured your compute.
I won’t dive into all of the individual battles you’d need to fight and win to deploy your own open-source LLM in this post, but I will note that smaller models can be trained for individual tasks without the hardware risk. Their performance will often surpass a larger, more general model, for that specific tasks.
We recommend fine tuning a commercially licensed, open-source LLM. This remains the most effective strategy for most teams. Based on our internal evaluations and publicly available data, the LLama2, Falcon-40B, MPT-30B, and Flan-UL2 models work best on general reasoning tasks. You will need to quantize the model to help it run on available GPUs, as well as build a training and testing pipeline to allow you to improve the model over time. You need to build in mechanisms for human feedback so you can understand how users feel about your model’s responses.