Overcoming LLM Scaling Challenges: Building Personalized AI with LoRA

Creating personalized, fine-tuned large language models (LLMs) tailored to each user or organization is an ambitious goal for any applied AI platform. However, for early-stage teams working within budget constraints, achieving this level of customization can seem unattainable given that fine-tuning is an expensive task with the most effective models (65B+ parameters)..

Determined to find a solution, I began asking myself: How can I build an effective, personalized LLM that goes beyond just Retrieval-Augmented Generation (RAG) context? I wanted it to integrate not only specific knowledge but also expertise, processes, and unique requirements for each user.

Here’s the journey we’re taking to achieve this.

Exploring Separate LLMs and Micro-LLMs

One option I considered was separating out each LLM per user and training them individually. However, for larger models—such as those with 65B+ parameters—this approach would be entirely unscalable.

This led me to experiment with smaller models, specifically micro-LLMs around 1B to 3B parameters, like the Llama-3B/1B. The 1B model was a quick disqualification; it simply didn’t meet our intelligence needs despite extensive context fed through RAG or fine-tuning.

The 3B model, on the other hand, showed some promise. It performed decently, and would do well in a multi-agent orchestration system, perhaps with an OpenAI-inspired swarm of agents. However, further testing revealed that even the 3B model fell short when handling complex prompts, especially when multiple data sources were integrated to produce user-facing recommendations.

Also, scaling this approach at our current phase would demand dedicated DevOps and LLM engineers, along with the complexity of a microservices-like architecture, which adds to the cost and complexity of the solution.

Returning to RAG Alone: Is It Enough?

With micro-LLMs proving unsustainable, it was time to reconsider whether RAG alone with a large model like GPT-4o could provide sufficient personalization. Many applied AI platforms already rely on RAG systems that can be enhanced over time. However, with rapid advancements in the field, further improvements could be needed monthly—if not weekly. (Expect more on RAG in a future post!)

While RAG alone was an option, I continued exploring—and then I came across LoRA.

Discovering LoRA: Low-Rank Adaptation of LLMs

LoRA, or Low-Rank Adaptation, offers an innovative approach to LLM fine-tuning. In simple terms, LoRA allows for fine-tuning of an LLM by attaching the fine-tuned weights as a separate layer. This separate “LoRA layer” can be swapped out easily, enabling customization per user or company without the need to retrain the entire LLM. This approach ensures that training data remains secure, with no leakage between LoRA layers.

The preliminary results for this LoRA enhanced architecture have been promising. It provides an efficient way to deliver personalized LLMs without the extensive infrastructure or costs typically associated with traditional fine-tuning.

Integrating LoRA, RAG, and a Centralized LLM: Our AI Architecture

Here’s an overview of of the solution architecture for user and company level customization, which combines LoRA, RAG, and a primary LLM to deliver highly personalized outputs while safeguarding data privacy:

Here's an animation of how the architecture would function.

Main LLM - This is our core model, which we’ve fine-tuned for domain-specific expertise in sales, account management, and executive-level knowledge. The training data includes industry best practices, modern sales frameworks, and insights from our founders’ experiences.
LoRA Layer - Each customer’s unique data is stored here, including their specific sales processes, technology stack, core offerings, and key personnel details. The LoRA layer enables per-customer customization while keeping data separate and secure.
RAG - RAG manages user-specific data stored in a vectorized format. Integrated data from CRMs, engagement metrics, and other sources provide relevant context for generating personalized responses.
SQL - Structured data is maintained here, supporting various system integrations and secure data handling.

These components working together produces a personalized and secure output tailored to each user’s context.

Future Potential: Considering LLaMA-Adapter

While LoRA offers us a stable approach, we’re also keeping an eye on LLaMA-Adapter, a new alternative that shows promise. Released in 2024, LLaMA-Adapter could offer even better performance. However, it’s still early to consider it for production use, and for now, LoRA remains our chosen approach.

With this architecture, we’re excited about the possibilities ahead as we continue to explore new methods to enhance personalization and user experience.