Skip to main content

By Katinka Gereb, Arvin Lim, Jim Hortle, James Dunwoody, Alex Essebier and Agnes Abramczyk


Artificial intelligence has evolved significantly in the past year since the introduction of ChatGPT. In the subsequent months, a surge of diverse language models has inundated the landscape, ranging from open-source initiatives to proprietary advancements. It’s astonishing to reflect on how GenAI projects and applications constructed atop this cutting-edge technology have proliferated within such a short span.

In the course of this year, we have accomplished the successful implementation of multiple Generative AI (GenAI) projects at Mantel Group. The GenAI field, being a relatively recent emergence, has presented us with exciting challenges and opportunities. In light of the valuable insights and experiences we’ve gained from these projects, we are eager to share our knowledge.

1. The Identification of the right model and technology

Choosing the right model in GenAI involves navigating through various options, considering factors like open source versus proprietary models, sizing, and fine-tuning. The initial decision of whether GenAI is genuinely required is crucial, especially when adopting a technology-first approach. Alternatives to GenAI, such as traditional NLP or ML, should be explored before committing to a specific approach or model.

Closed-source models, like GPT-3.5 Turbo and Claude, outperform open-source models at the current time, offering better response quality and eliminating the need for complex and expensive infrastructure setup. Opting for closed-source models not only ensures higher response quality but also eliminates the need to manage and incur expenses for hosting in your environment. This is advantageous as hosting large open-source models can be both complex and costly. Furthermore, quick experimentation can be initiated without the need to wait for infrastructure setup. In the case of choosing an open-source model, a more extended process is to be expected, involving infrastructure setup, data collection, and experimentation, due to the novelty of the technology.

Low-Rank Adaptation of Large Language Models (LoRA) is a training technique designed to expedite the training of substantial models with reduced memory consumption. LoRA fine-tuning techniques, particularly with limited data, appear more suited for adjusting existing model behaviour, rather than teaching the model new capabilities. They prove effective in tasks like formatting responses or adjusting the tone of the model’s output.

For tasks involving knowledge retrieval from static information, LLM retrieval from vector databases proves to be more suitable.

One of our pivotal takeaways underscores the importance of starting with the largest and most performant model available. It is recommended that experiments are more substantial models, such as Llama 70B, before exploring smaller counterparts like 7B. This strategic approach is aimed at minimising the risk of investing time in fine-tuning smaller models, only to realise that the concept works seamlessly on a larger scale. Getting the solution working on a big model proves that the concept is possible and simplifies the problem to an engineering one, driving the costs down by smarter inference strategies or replicating results on a smaller model.

2. Data requirements

In line with any other machine learning project, the importance of having clean and high-quality data cannot be overstated during model training. Utilising an LLM to augment the dataset with additional examples (synthetic data) is considered a valid approach, particularly in cases where obtaining clean data is challenging. Nevertheless, depending on the same prompt for all data may introduce patterns or similarities in the final output. It is advisable to incorporate variability in prompts or instruct the LLMs to diversify their outputs.

In LoRA fine-tuning, the emphasis on clean and consistently styled input text is pivotal due to the nature of the task. The model relies on refined input to grasp and reproduce specific stylistic nuances accurately. The consistency in style and tone ensures that the model can effectively learn and adapt to the desired characteristics, resulting in more coherent and contextually appropriate outputs. On the other hand, prompt engineering, while not dependent on training data, provides valuable insights when assessed with a variety of inputs. This diversity allows for a comprehensive evaluation of the model’s responsiveness to different prompts, helping refine and optimise its performance across a spectrum of scenarios and queries.

While data suitable for LLMs should be clean and in natural language, additional information like dictionaries and metadata may be needed for the LLM to perform effectively at a high level. In use cases where extensive data requirements are encountered, preprocessing and formatting layers are necessary to ensure proper interpretation of language by the model. Metadata, capturing various aspects like data quality, topics, and speech styles, plays a crucial role in selecting curated training datasets. It is also important to consider business jargon, acronyms, and context-specific information for such tasks.

3. Infrastructure set-up

When establishing infrastructure for model training across cloud platforms, opting for regular virtual machine instances (e.g. EC2) instead of platform-specific services (e.g. SageMaker) can offer a wider range of machine options at a potentially more cost-effective rate. Utilising the robust remote development capabilities of tools like VS Code through SSH provides seamless access to various virtual machines, granting flexibility in terms of computing power and disk size for model training. This approach ensures a familiar development environment for data scientists and engineers. On the other hand, leveraging platform-specific services might be preferable for their ease-of-use, providing integrated machine learning tooling within the environment. It’s worth noting that certain GPU instance types may have limited availability for specific services in certain regions, whereas general-purpose virtual machines often offer broader regional availability for these GPU types.

While developing / running models, it is recommended to use nvitop, a GPU and CPU monitoring command-line tool that allows for preemptive monitoring of memory usage and avoiding crashes without the need for extensive code instrumentation. This is especially handy to check whether you are running into VRAM limitations as you run larger LLMs.

4. Price optimisation

For hosting models on virtual machine instances, it’s advisable to use the smallest instance that accommodates the desired model sizes. It’s crucial to ensure these instances are turned off when not in use to manage costs effectively. When utilising APIs, such as the OpenAI API, optimising for the smallest output token and context window size helps minimise expenses. Where possible, remember to set the maximum output token length to be as small as possible.

Although starting with the largest models is useful for experimentation, once an acceptable baseline of performance has been reached, it is important to try and determine whether the same level of performance can be achieved with a smaller (and often cheaper) model. The additional costs required to run large models can be substantial. For example, without quantisation, running a 70B model requires more than 48GB of VRAM, whereas a 30B model can fit comfortably only a 24GB GPU. A proactive approach is needed to ensure that the utilisation of powerful models aligns with the allocated budget and avoids any unwarranted financial strain.

5. Hallucinations

Hallucinations, though infrequent, can be challenging, requiring trial and error with prompts and ensuring clean fine-tuning data to reduce illogical token sequences. Some tasks are out of the model’s reach and in these cases hallucinations are to be expected no matter how the prompt is formulated. One should not forget that the model doesn’t understand what is being said. It’s crucial not to trust an LLM for information straight from its pre-training unless it has been fine-tuned with a specific dataset or is only performing retrieval.

Employing an Assessor LLM or integrating intelligent business rules into the detection process allows for the effective identification of hallucinations and facilitates the regeneration of output. RAGs (retrieval augmented generation), also stand out as a widely adopted strategy in natural language processing that implies the combination of retrieval (fetching relevant information or responses from a pre-existing dataset) and generation (generating responses from scratch) techniques. These post-processing steps play a crucial role in refining the quality and reliability of the generated content, ensuring it aligns more closely with the intended results.

6. Prompt engineering

Prompt engineering is a nuanced art that significantly influences the performance of language models, particularly in chatbot interactions. A well-crafted prompt ensures desired outcomes, but its placement and formulation demand careful consideration. Understanding its intricacies and investing time in trial and error pays dividends, especially when prompt engineering is combined with fine-tuning.

  • Position Matters: The location of instructions within the prompt impacts model response. Experimentation reveals that instructions at the start or early in prompt history carry weight. While summarising and shortening prompts can mitigate risks, validating instruction adherence is crucial, regardless of prompt placement.
  • System Prompt Significance: Certain models, like Llama 2 Chat, place importance on the system prompt. Understanding and adhering to the expected format, such as the Alpaca instruct format, is vital for optimal model performance. Another notable aspect is that the Alpaca instruction/input/output format offers flexibility, making it easy to adapt any training data into this format for fine-tuning.
  • Positive Framing: Avoid negative phrasing in prompts. Opt for positive instructions, enhancing clarity and effectiveness. For instance, use ‘keep your tone friendly’ instead of ‘don’t use an unfriendly tone.’
  • Temperature and Seed Control: For better comparison, either set a random seed for the model output or lower the temperature to a small value like 0.1. Outputs can vary wildly with the same prompt with high temperatures.
  • Few Shot Prompting: Utilise few-shot prompting (including a few examples in the prompt) as an effective alternative to fine-tuning. Including a few examples in the prompt can yield impressive results.
  • Avoid prompt stuffing: Adhere to established best practice guides to avoid pitfalls like “prompt stuffing” – trying to fit all the various instructions into the prompt. Break down tasks into sub-tasks; for instance, rather than a single prompt requesting a creative story with mystery, adventure, and humor, separate prompts for each element yield more focused and coherent results. Foster a systematic approach, and leverage frameworks like Langchain for scalability; chaining multiple smaller prompts can yield better results than one big prompt. If still facing challenges, explore more complex processes, like chaining together a series of prompts, to achieve desired outcomes.
  • Prompt tracking: Small changes in prompts can yield significantly different outputs. Careful attention to prompt details is essential for achieving desired results. Employ systematic tracking of prompts using tools like Langchain. This ensures organised testing, simplifying the management of numerous prompts.

7. Interpreting model output

Assessing output quality in Generative AI is more work-intensive than traditional Machine Learning methods. While metrics like perplexity (how well a model predicts a sequence of words) can be employed with a reliable dataset, incorporating human judgement, especially from subject matter experts (SMEs), is crucial for a comprehensive evaluation process. An integrated approach where the SME works directly with the data scientists to develop data and prompts is often more effective. An additional benefit of this is that mismatched expectations around model capabilities and project velocity are more easily avoided.

In tasks involving text summarisation, error identification, or explanation, relying on conventional accuracy metrics is impractical. To enhance model evaluation, a quick and effective validation process for text outputs is crucial, especially considering the challenge of quantifying accuracy in many tasks. Incorporating a human evaluation step proves instrumental in ensuring the quality of model outputs. Additionally, the adoption of an experiment tracking platform like MLFlow, a GUI/script for prompt and parameter testing (e.g., OpenAI playground), and the complexities involved in productionising models are key considerations in the overall model development workflow.

8. Security and regulatory compliance

When it comes to prompt experimentation, it’s important to exercise caution, especially on platforms like ChatGPT web UI or Aviary. Experimenting with prompts on these platforms could be deemed as a potential data breach. To mitigate risks, it is advisable to conduct such experiments on personal instances or within the secure OpenAI organisation environment designed specifically for the project. This approach not only ensures data security but also aligns with best practices in ethical AI experimentation. Additionally, these third-party services may inadvertently pollute your experiments by having their own system prompts, guardrails or modifications that may alter the results of your experiments.

Alternatively, removing or anonymising personally identifiable information (PII) data with specific details such as names, addresses, or other identifying elements from a dataset before sending it to an API is advisable. This way, users can mitigate the risk of exposing personal information during processes that involve data exchange. As a precautionary measure, practitioners should carefully navigate the ethical and legal landscape, considering potential privacy implications and ensuring compliance with relevant regulations. In essence, a thoughtful and cautious approach is crucial to safeguarding both data integrity and privacy throughout the experimentation and fine-tuning processes.

9. Ethical considerations

In any data science venture, a critical and recurrent consideration involves evaluating the potential impact of model outputs on individuals’ lives and well-being. As models generate predictions, recommendations, or responses, these outcomes can have tangible consequences. It is paramount to assess how the information provided by the model may influence decisions or actions that directly affect people. Whether the context involves employment, financial decisions, or other aspects of daily life, understanding the potential implications of the model’s outputs is fundamental for responsible and ethical data science practices.

Furthermore, this consideration extends to an awareness of potential biases embedded within LLMs. These biases may arise from the data on which the models were trained and can inadvertently influence the generated content. Acknowledging and addressing these biases is essential to prevent unintended discrimination or inequitable outcomes. Evaluating the impact of LLM-generated content within diverse contexts helps ensure that the technology is used responsibly and ethically, fostering a commitment to fairness and equity in the deployment of data science solutions.

To enhance interpretability, one can explore the application of chain-of-thought reasoning as a mechanism. This involves highlighting and recording the distinct steps the LLM took to produce the output. While not technically transparent, as each step generates its own LLM content, it can add a layer of explainability to the model, providing insights into its decision-making process and allowing maintainers to pinpoint where undesirable content entered the output.

10. Skills & training

GenAI is a dynamic field evolving at a rapid pace and lacking a definitive set of best practices. Keeping up with the latest research is crucial, requiring adaptability and a readiness to embrace emerging trends.

Starting with simple prompts in a secure environment, like a private LLM-powered chatbot, lays a solid foundation. Encouraging teams to create specific prompts tailored to their needs fosters a repository of ‘approved’ or ‘suggested’ prompts. Senior leaders should actively engage with LLMs in their daily routines, enabling them to make informed decisions about the technology.

Fine-tuning possesses challenges, with existing tools still in the early stages of maturity. Debugging and comprehending interactions between components demand time and patience, yet advancements are expected in this space. HuggingFace stands out as a crucial library to master, playing a pivotal role in navigating the complexities of GenAI.


In conclusion, as we brave the relatively new terrain of Gen AI, the key to success lies in a continuous cycle of experimentation, learning, and knowledge sharing. Deciphering the optimal methods for training, fine-tuning, and prompt usage resembles solving a puzzle—it’s challenging to nail down the perfect formula. Since hallucinations can occur and it’s challenging to ensure the accuracy of models, it’s important to include human verification in the results.

Our focus should extend to practical considerations, contemplating how to seamlessly integrate these models into real-world scenarios and weave them into our everyday workflows. This involves not only optimising their performance but also ensuring that the application of GenAI aligns with our existing processes, enhancing efficiency and usability. A thoughtful integration strategy becomes paramount to harness their full potential and contribute meaningfully to our daily operations.