The power of AI in your own hands. Is self-hosting your LLM a good idea?

By Stefan Fernandes

“In a world where machines are becoming artists, composers, and storytellers, generative AI holds the key to unlocking unparalleled creativity. But as we dive deeper into this exciting frontier, a critical decision awaits: Do you embrace the DIY spirit and self-host your generative AI, or do you seek refuge in the realm of cloud partners, letting them shoulder the burden while you focus on the magic?” – ChatGPT

The generative AI (genAI) community is in the midst of an arms race between the tech giants and open source platforms. Don’t just take our word for it, a leaked Google memo from an unnamed researcher details how open source has solved many of the issues that the tech companies have struggled with. There is no “secret sauce” that strongly differentiates proprietary models and, whilst the fire may have been lit by these first generation models, there is no reason why community led solutions can’t continue to pour gasoline over the flames.

The generative artificial intelligence industry is young and is expanding at a pace that is quite amazing to witness, picking up from where Moore’s Law has left off – when Hollywood is worried, as news of the latest script-writers protest demonstrates, you know you’ve already captured the wider public’s interest. In our experience, it has become almost a foregone conclusion that the solutions we will deliver this month will already be outdated by the next, as chat forums buzz with the announcement of newer technologies.

As a quick explainer, most generative AI models, like ChatGPT, DALL-E, Bard, typically exist as a black box that are proprietary to the company that built it. Therefore in order to interact with the model, your prompt is sent via an API to the company’s servers, where a response is generated and subsequently returned back to you*, much like how a web server would work. Self-hosting in contrast takes this process all ‘offline’, where the model sits within your own infrastructure, allowing the developer to freely tinker around with model parameters and generate inferences in-house.

*Some companies are now starting to allow the option to host proprietary models within your own environment (although still as a black box), which can negate some of the issues later discussed. However, to keep this simple in this article, we have defined proprietary models in this way.

Whilst proprietary models have had the limelight, open source models are becoming more numerous. There is a growing availability of open-source models, with LLaMA and GPT-J to name a few popular choices, and libraries, like Llama.cpp and langchain, that are wholly community driven, making self-hosting an alluring possibility. No doubt, there will be a proliferation of bespoke LLM models and adapters tailored to use cases that are radically unique and novel.

So the question is, like the talented Karate Kid in need of some spiritual guidance, is it better to manage and host your own LLMs, or entrust the job to a cloud provider like a learned Mr. Miyagi? Let’s understand the details.

Building for scale

Heard of that new model with a few billion parameters, or was that a trillion? Yes, our AI overlords are becoming increasingly large and will soon rival that of the human brain by virtue of the sheer number of neural connections. However, unlike mother nature who has perfected the art of frugality, LLMs are still intensely power hungry. To avoid Chernobyl-style server meltdowns, organisations must be mature enough to meet GPU costs and have the skills required to scale up tech solutions. Cloud providers, on the flip side, would typically charge a fee for the service, using the length of an LLM prompt (token size) as the chargeable rate. Before a CFO starts tightening the ‘token’ purse strings, it is worth assessing whether the computing and maintenance costs outweigh the costs of service that an external provider could provide. This is not an easy question to answer and, as models continue to get larger, it may become infeasible (and also environmentally irresponsible) for organisations to each run their own megaserver. However, it stands to reason that costs in the long run will continue to decrease and, as the industry scales, there will be a demand for more efficient chips. Nvidia’s success in recent years is only an indication of things to come.

It will always be unrealistic for non-tech companies to train models from scratch, as the scale of compute is just too large – it is not unwise to see these proprietary infrastructure advancements as natural monopolies. Whilst you may be tied to a certain commercial model for your solution needs (with a risk of model deprecation), both the upgrade of existing models and the release of entirely new ones are an inevitability, given the luxury of scale and commercial imperative that these tech firms have. There are also a bunch of smart architectural designs that may give these models an edge for now. It is rumoured that GPT-4 deploys a mixture of experts (MoE) model that uses ensemble techniques and relies on many specialised models that each serve its own purpose to drive model inferences.

Right model for the right use case

There are other more substantive questions in model design that we should also bear in mind. Here are a few considerations we’ve experienced:

Who are the users of the model and how do they expect to interact with it?
What is the expected token usage?
What are the latency requirements?
How complex is the use case and does it require more powerful models?
Who will be responsible for managing or maintaining this solution?
Is sensitive or proprietary data involved?
How important are guardrails for safety and for model hallucination?

Model selection is key, however it’s easy to be overwhelmed by the sheer volume of choice out there. Before decision paralysis sets in, it can be easier to initially rely on commercial tech, whose modus operandi is to produce ethically responsible and well-tested releases that are regularly iterated upon; safe in the knowledge that no tech company wants to be responsible for the next Skynet. Reinforcement learning by human feedback (RLHF) is a costly process and involves many thousands of hours of human effort to sanitise and censor outputs so that models are usable for public consumption. Ensuring guardrails on the technology is the responsible thing to do and realistically it will only be the tech firms who have the capability to do it well. The guidance and curated knowledge that they can bring can be invaluable.

Supporting newly emerging training and fine-tuning architectures

However for the anarchists amongst us, this may be too great an imposition to bear. After all, shouldn’t technology, like the creative arts, be levelled with the same artistic freedom? The renaissance of generative AI, in no small part, owes its success to the invigoration of a community that has not traditionally participated in the technosphere. Tinkering around with customisable models is just part of the intrigue. Adapters and techniques like LoRA, shorthand for ‘low rank adaptation’, have made it possible to fine-tune large models efficiently and quickly by freezing existing model parameters and training on much smaller, more task-specific sets of data. Think of it like chiselling new features into Michalangelo’s David, instead of taking a sledgehammer to the marble and starting from scratch. The modularity of each adapter leads to improvements in the model that accumulate over time and have the added benefit of being both time and cost effective.

This is great for domain-specific applications, where bespoke data and even a modest computing set-up can create solutions that are more customisable and scalable than what larger, off-the-shelf behemoths could do. New data can be continually added into a model and avoids the risk of models presenting out-of-date information, something that can plague commercial releases, trained on data with date cut-off points. Coding assistants, as an example, can hugely benefit from this, as there is always an imperative to have the most up-to-date information.

Biases within LLM models will also inevitably exist, given the corpus of training data and any RLHF censoring procedures put in place. Open source solutions can give more control over this process and can avoid potential over-censorship, opting instead to use a company’s own ethical and governance framework to guide the outputs of a model. This of course is fraught with its own dangers and can lead to all but new biases to contend with. The perils of a potty-mouthed LLM can have significant reputational and ethical implications that should not be taken lightly and we all have a responsibility to produce solutions that are safe and well-intentioned. This is a Pandora’s box, let’s tread this space with the utmost care.

Your model, your data

There is also a question of privacy. By agreeing to the use of commercial LLMs, it is quite normal for your data to be used for the purposes of model development and, whilst it is unlikely that a nosey AI will air your dirty laundry, it can be a possibility, especially in extremely niche subject areas. Where data sovereignty is a top priority, self-serve models may be the better option.

Having said this, it is in the business interest of cloud companies to keep your data safe and secure and reading the small print may help alleviate any of your concerns. It has also been encouraging to see big tech taking a proactive role in calling for governments to start regulating the space. No one knows yet how this may look and there are still deep ethical questions regarding data stewardship, freedom of use and safety within AI. Preempting and avoiding many of the doomsday scenarios that have been forewarned by (too) many leading figures in the field, can give some confidence that the industry is taking things seriously.

The right to (truly) open AI?

There is a compelling argument to be made that LLMs, much like the internet, should be accessible to all who wish to use it. The open source community has been part and parcel of much of the innovation that has occurred of late and like any great scientific endeavour, new knowledge is pursued from the backs of others. BLOOM, the BigScience Large Open-science Open-access Multilingual Language Model trained on 1.6 TB pre-processed multilingual text, is a great example of the collaboration of hundreds of researchers who wish to see freely accessible and transparent models enter the market. Not to mention, it even rivals in size to openAI’s infamous GPT-3. Meta has also committed to the cause by releasing Llama to the public, and even if this may have been a commercially driven decision, it has undoubtedly opened the floor.

Whilst this all may allude to a reduced dependency on the cloud providers, they still have a crucial part to play as technology incubators and thought leaders, shepherding the way to greener, silicon-based pastures. The advent of AI will soon upend many parts of the world economy and we will need experts that will provide checks and balances in order to ensure safety, responsible use and fairness. This is a challenge readily accepted by both the private and open source sector, who see opportunity and value in working together. Providing safeguarded and well-integrated platforms that protect the primacy of data privacy and ownership can avoid many of the moral hazards that could trip us up. Legislation and regulation will also form part of the picture and it will be up to big tech and the AI community at large to pave the way.

Conclusion

The race to achieve better, more formidable models will be achieved by both private industry and open source. It would be wrong to frame this a direct conflict between the two and, as a data company ourselves, we see real strength in all players embracing each other to unlock the potential of advanced AI. Whether your organisation requires fast, reliable and safeguarded solutions in the form of proprietary models, or takes the reins via self-hosting to maintain data sovereignty, control and tunability, the generative AI community will continue to grow from strength to strength.

Building for scale

Right model for the right use case

Supporting newly emerging training and fine-tuning architectures

Your model, your data

The right to (truly) open AI?

Conclusion

Previous PostAWS Control Tower APIs at re:Invent 2023

Next PostWalk before you run: How to get Gen AI working for your business

Explore

Services

Industries