Creating personalized, fine-tuned large language models (LLMs) tailored to each user or organization is an ambitious goal for any applied AI platform. However, for early-stage teams working within budget constraints, achieving this level of customization can seem unattainable given that fine-tuning is an expensive task with the most effective models (65B+ parameters)..
Determined to find a solution, I began asking myself: How can I build an effective, personalized LLM that goes beyond just Retrieval-Augmented Generation (RAG) context? I wanted it to integrate not only specific knowledge but also expertise, processes, and unique requirements for each user.
Here’s the journey we’re taking to achieve this.
One option I considered was separating out each LLM per user and training them individually. However, for larger models—such as those with 65B+ parameters—this approach would be entirely unscalable.
This led me to experiment with smaller models, specifically micro-LLMs around 1B to 3B parameters, like the Llama-3B/1B. The 1B model was a quick disqualification; it simply didn’t meet our intelligence needs despite extensive context fed through RAG or fine-tuning.
The 3B model, on the other hand, showed some promise. It performed decently, and would do well in a multi-agent orchestration system, perhaps with an OpenAI-inspired swarm of agents. However, further testing revealed that even the 3B model fell short when handling complex prompts, especially when multiple data sources were integrated to produce user-facing recommendations.
Also, scaling this approach at our current phase would demand dedicated DevOps and LLM engineers, along with the complexity of a microservices-like architecture, which adds to the cost and complexity of the solution.
With micro-LLMs proving unsustainable, it was time to reconsider whether RAG alone with a large model like GPT-4o could provide sufficient personalization. Many applied AI platforms already rely on RAG systems that can be enhanced over time. However, with rapid advancements in the field, further improvements could be needed monthly—if not weekly. (Expect more on RAG in a future post!)
While RAG alone was an option, I continued exploring—and then I came across LoRA.
LoRA, or Low-Rank Adaptation, offers an innovative approach to LLM fine-tuning. In simple terms, LoRA allows for fine-tuning of an LLM by attaching the fine-tuned weights as a separate layer. This separate “LoRA layer” can be swapped out easily, enabling customization per user or company without the need to retrain the entire LLM. This approach ensures that training data remains secure, with no leakage between LoRA layers.
The preliminary results for this LoRA enhanced architecture have been promising. It provides an efficient way to deliver personalized LLMs without the extensive infrastructure or costs typically associated with traditional fine-tuning.
Here’s an overview of of the solution architecture for user and company level customization, which combines LoRA, RAG, and a primary LLM to deliver highly personalized outputs while safeguarding data privacy:
These components working together produces a personalized and secure output tailored to each user’s context.
While LoRA offers us a stable approach, we’re also keeping an eye on LLaMA-Adapter, a new alternative that shows promise. Released in 2024, LLaMA-Adapter could offer even better performance. However, it’s still early to consider it for production use, and for now, LoRA remains our chosen approach.
With this architecture, we’re excited about the possibilities ahead as we continue to explore new methods to enhance personalization and user experience.