LLM Quantization in a nutshell

TL;DR This blogpost summarizes the buts and bolts of LLM quantization with llama.cpp. Introduction to Quantization The Technical Foundation of LLM Quantization Quantization, in the context of machine learning, refers to the process of reducing the precision of a model’s parameters, typically converting floating-point numbers to lower-bit representations. This has profound implications for model deployment, particularly in rendering sizable LLMs more accessible. Understanding Quantization Quantization works by mapping the continuous range of floating-point values to a discrete set of levels....

January 28, 2024 · 4 min

What is Parameter Efficient Finetuning?

TL;DR Parameter Efficient Fine-Tuning is a technique that aims to reduce computational and storage resources during the fine-tuning of Large Language Models. Why should i care? Fine-tuning is a common technique used to enhance the performance of large language models. Essentially, fine-tuning involves training a pre-trained model on a new, similar task. It has become a crucial step in model optimization. However, when these models consist of billions of parameters, fine-tuning becomes computational and storage heavy, leading to the development of Parameter Efficient Fine-Tuning methods....

November 1, 2023 · 11 min

Hands on with Retrieval Augmented Generation

TL;DR This blogpost shows an example for a Chatbot that uses Retrieval Augmented Generation to retrieve domain specific knowledge before querying a Large Language Model Hands on with Retrieval Augmented Generation For a primer on Retrieval Augmented Generation please read my other post What is Retrieval Augmented Generation?. Retrieval Augmented Generation can be a powerful architecture to easily built knowledge retrieval applications which (based on a recent study) even outperform LLM’s with long context windows....

October 7, 2023 · 5 min

What is Retrieval Augmented Generation?

TL;DR This blogpost tries to explain Retrieval Augmented Generation. Retrieval Augmented Generation is an Architecture used for NLP tasks which can be used to productionize LLM models for enterprise architecture easily. Why should i care? Intuitive would be to train a Large Language Model with domain specific data, in other words, to fine-tune the model-weights with custom data. picture from Neo4J But fine-tuning large language models (LLMs) is a complex and resource-intensive process due to several key factors:...

September 29, 2023 · 8 min

Generalized Models vs Specialized Models

TL;DR This blogpost focusses on ML Design and Architecture and tries to give some intuition and hints for deciding between one generalized and multiple specialized models for the same business requriement and dataset. Consider it a nudge to dive deeper into the topic Why should i care? Transparency in machine learning is crucial for business stakeholders because it fosters trust and informed decision-making. Business Stakeholders need to understand not only the potential benefits but also the limitations and risks associated with machine learning models....

September 5, 2023 · 7 min