A view on Byte Latent Transformers

TL;DR This blog post explains the Byte Latent Transformer (BLT), a tokenizer-free architecture for NLP tasks. BLT processes raw byte data dynamically, making it a scalable, efficient, and robust alternative to traditional token-based models. Why Should I Care? Traditional LLMs rely on tokenization—a preprocessing step that compresses text into a fixed vocabulary. While effective, tokenization introduces several challenges: High Costs: Fine-tuning LLMs with domain-specific data demands extensive computational resources, often requiring significant financial investments....

January 2, 2025 · 5 min

LLM Quantization in a nutshell

TL;DR This blogpost summarizes the buts and bolts of LLM quantization with llama.cpp. Introduction to Quantization The Technical Foundation of LLM Quantization Quantization, in the context of machine learning, refers to the process of reducing the precision of a model’s parameters, typically converting floating-point numbers to lower-bit representations. This has profound implications for model deployment, particularly in rendering sizable LLMs more accessible. Understanding Quantization Quantization works by mapping the continuous range of floating-point values to a discrete set of levels....

January 28, 2024 · 4 min

What is Parameter Efficient Finetuning?

TL;DR Parameter Efficient Fine-Tuning is a technique that aims to reduce computational and storage resources during the fine-tuning of Large Language Models. Why should i care? Fine-tuning is a common technique used to enhance the performance of large language models. Essentially, fine-tuning involves training a pre-trained model on a new, similar task. It has become a crucial step in model optimization. However, when these models consist of billions of parameters, fine-tuning becomes computational and storage heavy, leading to the development of Parameter Efficient Fine-Tuning methods....

November 1, 2023 · 11 min

Hands on with Retrieval Augmented Generation

TL;DR This blogpost shows an example for a Chatbot that uses Retrieval Augmented Generation to retrieve domain specific knowledge before querying a Large Language Model Hands on with Retrieval Augmented Generation For a primer on Retrieval Augmented Generation please read my other post What is Retrieval Augmented Generation?. Retrieval Augmented Generation can be a powerful architecture to easily built knowledge retrieval applications which (based on a recent study) even outperform LLM’s with long context windows....

October 7, 2023 · 5 min

What is Retrieval Augmented Generation?

TL;DR This blogpost tries to explain Retrieval Augmented Generation. Retrieval Augmented Generation is an Architecture used for NLP tasks which can be used to productionize LLM models for enterprise architecture easily. Why should i care? Intuitive would be to train a Large Language Model with domain specific data, in other words, to fine-tune the model-weights with custom data. picture from Neo4J But fine-tuning large language models (LLMs) is a complex and resource-intensive process due to several key factors:...

September 29, 2023 · 8 min