Agentic AI (AI Agents)
An AI system that goes beyond chatting to actively perform tasks. It uses “reasoning” to break down a complex goal into steps, uses tools to execute those steps, and reflects on the results. Key Components: Planning (Thinking), Tool Use (Acting), Memory (Remembering context).
ChromaDB (Vector Database)
A specialized database designed to store Vector Embeddings. It allows you to perform “Semantic Search” (searching by meaning rather than exact keywords). Used heavily in RAG pipelines.
Context Window Optimization
Techniques used to manage the limited “short-term memory” (Context Window) of an LLM. Techniques: Summarization chains, sliding windows, or RAG (fetching only relevant snippets) to ensure you don’t overflow the model’s token limit.
Data Curation
The engineering process of cleaning, formatting, and filtering raw data (e.g., messy PDF manuals) before feeding it to an AI. Rule of Thumb: “Garbage In, Garbage Out.” High-quality curation is often more important than the model architecture itself.
Fine-Tuning
The process of taking a pre-trained model (like Llama 3) and training it further on a specific dataset to change its behavior or knowledge.
Hugging Face Transformers
A Python library that simplifies downloading and using pre-trained models. It provides a standard interface to load almost any model (Llama, Mistral, BERT) with just a few lines of code.
LangChain
A “Glue Code” framework that helps chain together different AI steps (e.g., “Get User Input” -> “Search Vector DB” -> “Send to LLM”).
LangGraph
An extension of LangChain specifically designed for building Agents. It allows for loops and cyclic graphs (e.g., “Try to run code -> Did it fail? -> If yes, debug and retry”).
LM Studio
A GUI-based tool optimized for consumer hardware (Mac/Windows). Best used for exploration—quickly testing models and prompts manually before writing code.
LoRA (Low-Rank Adaptation)
The most popular type of PEFT. It inserts small “adapter layers” into the neural network. Why it matters: Allows you to fine-tune a huge model on a single consumer GPU (or Google Colab) in hours rather than using a supercomputer cluster.
MLOps (Machine Learning Operations)
The application of DevOps principles to Machine Learning systems. It covers the end-to-end lifecycle: data preparation, model training, automated testing, deployment, monitoring, and retraining.
Multi-Agent Orchestration
A design pattern where multiple specialized AI agents collaborate to solve a problem. One agent might act as the “Researcher,” another as the “Coder,” and a third as the “Reviewer.”
Ollama
A command-line tool and API optimized for Mac/Linux. Best used for development—running models locally while writing Python code for agents.
PEFT (Parameter-Efficient Fine-Tuning)
A category of fine-tuning methods where you freeze the massive main model and only train a small number of extra parameters (adapters). This drastically reduces the hardware required to train models.
PyTorch
The open-source machine learning library developed by Meta. It is the industry standard for research and Generative AI, handling the complex mathematics required for Neural Networks.
Quantization (GGUF / AWQ)
Reducing the precision of the model’s numbers (weights) to save memory (RAM/VRAM).
- GGUF: Optimized for CPUs and Apple Silicon (used by Ollama/LM Studio).
- AWQ: Optimized for Nvidia GPUs (used in production serving).
RAG (Retrieval-Augmented Generation)
A technique to “ground” an LLM with external data. Instead of relying solely on its training data, the system searches a private database (like PDF manuals) for relevant info and inserts it into the prompt before the LLM generates an answer.
Vector Embeddings
Converting text into lists of numbers (vectors) that represent the meaning of the text. Example: The vector for “King” minus “Man” plus “Woman” results in a vector close to “Queen.”
Vertical AI
Artificial Intelligence models trained or fine-tuned specifically for a single industry or domain (e.g., Telecom, Legal, Medical), rather than general-purpose tasks.
vLLM
A high-performance Python library for serving models (Running the API). It is the industry standard for Production deployment because of its high throughput using PagedAttention.