QLoRA - How to Fine tune an LLM on a Single GPU

Posted Sep 7, 2024

37 min read

(00:00) fine-tuning is when we take an existing model and tweak it for a particular use case although this is a simple idea applying it to large language models isn’t always straightforward the key challenge is that large language models are very computationally expensive which means fine-tuning them in a standard way is not something you can do on a typical computer or laptop in this video I’m going to talk about Cur which is a technique that makes fine-tuning large L language models much more accessible and (00:33) if you’re new here welcome I’m sha I make content about data science and Entrepreneurship and if you enjoy this video please consider subscribing that’s a great no cost way you can support me in all the content that I make since I talk in depth about fine-tuning in a previous video of this series here I’ll just give a highlevel recap of the basic idea so like I said before fine tuning is tweaking an existing model for a particular use case so an analogy for this fine tuning is is like taking a raw (01:01) diamond and refining it and Distilling it into something more practical and usable like a diamond you might put on a diamond ring in this analogy the raw diamond is your base model so this would be something like gpt3 while the final Diamond You Come Away with is your fine-tuned model which is something like chat GPT and so again the core problem with fine-tuning large language models is that they are computationally expensive to get a sense of this let’s say you have a pretty powerful laptop and it comes with a CPU and a GPU where (01:34) the CPU has 16 GB of RAM and your GPU has 16 GB of RAM let’s say we want to finetune a 10 billion parameter model each of these parameters corresponds to a number which we need to represent on our machine standard way of doing this is using the fp16 number format which requires about two bytes of memory per parameter so just doing some simple math here 10 billion parameters time 2 bytes per parameter comes to 20 GB of memory just to store the model parameters so one problem here is that this 20 GB model won’t fit on the CPU or GPU but (02:14) maybe we can get clever in how we distribute the memory so the load of the model is split between the CPU and GPU and that allows us to do things like inference and make predictions with the model however when we talk about fine-tuning we’re talking about retraining the model par parameters which is going to require more than just storing the parameters of the model another thing we need are the gradients these are numbers that we use to update the model parameters in the training process we’ll have a gradient which is (02:46) just going to be a number for every parameter in the model so this adds another 20 GB of memory so we’ve went from 20 to 40 and now even if we get super clever with how we distribute it across our CPU and GPU GPU it’s still not going to fit so we’d actually need to add another GPU to even make that work but of course this isn’t the whole story you also need room for the optimizer States so if you’re using an Optimizer like atom which is very widely used this is going to take the bulk of the memory footprint for model training (03:18) where this is coming from is an Optimizer like atom is going to store a momentum value and variance value for each parameter in your model so we’ll have two numbers per parameter additional these values need to be encoded with higher Precision so instead of the fp16 format these are going to be encoded in the fp32 format and so when it’s all said and done there’s about a 12x multiplier for the memory footprint of these Optimizer states which means we’re going to need a lot more gpus to actually fine-tune this model these (03:51) calculations are based on reference number two which is a paper about zero which is a method for efficiently fine-tuning these deep neural networks works so we come to a grand total of 160 GB of memory required to train a 10 billion parameter model of course these enormous memory requirements aren’t going to fit on your laptop and it’s going to require some heavyduty Hardware to run so 160 GB if you get like a 80 gb GPU like the a100 you’ll need two of those at least and those are about $20,000 a pop so you’re probably talking (04:28) about like $50,000 just for the hardware to fine-tune a 10 billion parameter model in the standard way this is where Cur comes in so Cura is a technique that makes this whole fine-tuning process much more efficient so much so that you can just run it on your laptop here without the need for all these extra gpus before diving into Cur a key concept that we need to understand is quantization and even though quantization might sound like this scary and sophisticated word it’s actually a very simple idea whenever you hear (05:02) quantization just think splitting a range of numbers into buckets so as an example let’s consider any number between 0 and 100 obviously there are infinite numbers that can fit in this range you know there’s like 27 55.3 83.7 823 and so on and so forth what quantization consists of is taking this infinite range of numbers and splitting it into discrete bins one way of doing this is quantizing this infinite range using whole numbers so what that would look like for our three numbers here is that 27 would go into this 27th bucket (05:38) 55.3 would go into this 55 bucket and then 83.78% would go to 20 55 would go to 50 and 83 would go to 80 so that’s the basic idea and the reason this is important is that quantization is required whenever you want to represent numbers in a computer and the reason is that if you wanted to encode a single number that lives in an infinite range of possibilities this will require infinite btes of memory it just can’t be done at some point when you’re talking about a physically constrained system like a computer you have to make some (06:25) approximations and so if we go from this infinite range to this range quantized by whole numbers this would require about 0.875 bytes per number and then if we go one step further and just split it into these 10 different buckets it would require about half a bite per number one thing to point out here is that there’s a natural tradeoff you know we could have a lot of buckets which would give us a lot of precision but it’s going to increase the memory footprint of our model however you could have very few (06:56) buckets for quantization which would minimize the memory footprint but this would be a pretty crude approximation of the model you’re working with so balancing this tradeoff is a key contribution of Q Laura there are actually Four ingredients that come together to make up Q Laura the first is 4bit normal float the second is double quantization the third are paged optimizers and then finally is loraa I’m going to talk through each of these ingredients one by one starting with ingredient one one 4bit normal float all (07:32) this is is a better way to bucket numbers it’s a better way to do quantization so let’s break it down when we say something is 4 bit what we mean is we’re using four binary digits to represent that piece of information and since each digit can be either zero or one this gives us 16 unique combinations which means with a 4bit representation we have 16 buckets at our disposal for quantization compressing a range of numbers into just 16 buckets is great for memory saving you know we only have four bits which translates to half a (08:11) bite per parameter so if we have 10 billion parameters that’s going to translate to 5 GB of memory but of course this brings up the same problem I mentioned earlier which is we have this tradeoff it’s like yeah we get huge memory savings but now we have a very crude approximation of the number we’re trying to represent the way ingredient one this 4bit normal float works is it buckets the numbers in a particular and clever way suppose we have all the parameters in our model and we plot their distribution when it comes to (08:41) these deep neural networks it turns out that most of the parameter values are going to be around zero and very few values are going to be much smaller and much larger than zero what that means is we have something that resembles a normal distribution when it comes to our model parameters so if we follow a quantization strategy that I talked about a couple slides ago where we just split the numbers into these equally spaced buckets we’re going to get a pretty crude approximation of these model parameters because most of our (09:12) numbers are just going to be sitting in these two buckets here with very few numbers sitting in these end buckets here an alternative way we can do quantization is instead of using equally spaced buckets we can consider using equally sized buckets so instead of mapping each parameter into these eight buckets we map these parameter values into these eight buckets so now you can see that we have a much more even distribution of model parameters across these buckets and this is exactly the idea that 4-bit normal float uses to balance that (09:49) tradeoff between low memory and accurately representing model parameters so the next ingredient is double quantization which consists of quantizing the quantization constants I know the word quantize is appearing way more than anyone would ever like on this slide but let’s break it down step by step to see what this all means so consider this simple quantization strategy so let’s say we have this array of numbers X that’s represented using 32 bits and we want to translate it into an 8bit representation on the left hand (10:22) side here and then we want this 8bit representation to have values in between minus 127 and 127 essentially what we’re doing is we’re quantizing by whole numbers forcing it to live in this range ofus 127 to 127 so that’s what we’re trying to do so a simple way of doing that is we rescale all the values in this array by the absolute maximum value in the array and then we’ll multiply it by the new maximum value which is 127 in our quantized range and then we’ll round it just so that there are no decimal (10:56) points so this is a very simple way we can quantize this arbitrary array encoded in 32bit into a 8bit integer representation and just to make this more simple we can translate this prefactor here into a constant encoded in 32bit so while this simple quantization strategy isn’t how we do it in practice cuz again if we’re doing the equally sized buckets it’s not just going to be this linear transformation that we’re seeing here but this does illustrate the point that anytime you do Quant ization there’s going to be some (11:31) memory overhead involved in that computation so in other words these constants are going to take up precious memory in your system so as an initial strategy you might think well if we have this input tensor or input array and we rescale all the parameters we’re only going to have one new constant a 32bit number for all the parameters in our model what’s the big deal about that what’s another number compared to 10 billion parameters for example so while this does have trivial memory implications it may not be the best way (12:01) to quantize our model parameters because this is going to be very sensitive to extreme values in our input tensor and the reason is if we’re talking about these model parameters where most of them are close to zero but then you have this one parameter way far off in the tails that is your absolute Max it’s going to introduce a lot of bias in your quantization process so this standard quantization approach does minimize memory but it comes with maximum potential for bias an alternative strategy could be as follows where we (12:35) take the input tensor we reshape it to look like this and then we split this tensor into buckets and then within each bucket we do the rescaling process so this significantly reduces the odds of one extreme value skewing all the model parameters in the quantization process this is called blockwise quantization and although it comes with with a greater memory footprint it has a lot less bias so to mitigate the memory cost of this blockwise quantization approach we can employ double quantization which will do this quantization process here (13:14) but then we’ll do the quantization process once again on all these constants that pop up from this blockwise approach so if we just kind of repeat this very simple strategy here now we have an array of constants we have multiple constants popping out they’re encoded in 32bit and then we can quantize them into a lower bit format using this simple approach that’s double quantization so we are indeed quantizing the quantization constants while it might be an unfortunate name it is a pretty straightforward process so (13:45) ingredient three is a paged Optimizer all we’re doing here is looping in your CPU into the training process so let’s say we have a small model like 51 which has 1.3 billion parameters which which based on those same calculations we saw earlier would require about 21 GB of memory for full fine-tuning the dilemma here is that although we have enough memory across the GPU and CPU for all 21 GB needed to fully fine-tune 51 this isn’t something that necessarily just works out of the box these are independent modules on your machine and (14:23) typically the training process will just be restricted to your GPU and so this paged Optimizer what that means is instead of just restricting training to only fitting on your GPU you can move memory as needed from the GPU to the CPU and then bring it back onto the GPU as needed what that might look like is you’ll start model training and you’ll have one page of memory and a page of memory is like a fundamental unit or block of memory on the GPU or CPU the pages will start accumulating during the training process until your memory gets (14:59) full and then at which point if you have this paged Optimizer approach you can start moving pages of memory over to the CPU to make room for new memory for training and then if you need a page of memory that was moved to the CPU back onto the GPU you can make room for it there and then you can just move it back over using this paged Optimizer this is the basic idea honestly I don’t know exactly how this all works I’m not like a hardware guy I don’t know how computer architecture fully works but this is (15:29) like my highlevel understanding as a data scientist so if you want to learn more check out the cura paper where they talk a little bit more about it and provide some additional references the final ingredient of cura is loraa which stands for low rank adaptation and so I actually talked about Lura in depth in a previous video on fine-tuning so here I’m just going to give a brief highlevel description of how it works if you want more details you can check out that previous video or check out the low R paper Linked In the description below (16:01) what Laura does is that it fine-tunes a model by adding a small number of trainable parameters so we can see how this works by contrasting it with the standard full fine-tuning approach so let’s say this is our model here this is our neural network and we have this input layer we have some hidden layer and then we have the output layer here full fine tuning consists of retraining every single parameter in this model we’re just considering one layer at a time we’ll have this weight Matrix corresponding to all these lines in this (16:35) picture here we’ll have this Matrix W KN consisting of all the parameters for that particular layer and all of these are trainable while that’s probably not going to be a big deal about these six parameters in this shallow Network here if you have a large language model these matrices will get pretty big and you’ll have a lot of them because you’ll have a lot of layers Lura on the other hand instead of fine-tuning every every single parameter in your model it’ll actually freeze every parameter in the (17:03) model and it works by adding a small set of trainable parameters which you’ll then fine-tune the way this works is you’ll have your same hidden layer and then you’ll add a small set of trainable parameters through this Delta W Matrix so if you’re looking at this you might think well how does this help us because Delta W is going to be the same size as W KN so how is this adding a smaller set of trainable parameters and so the trick with Laura is that this Delta W will actually be the product of two smaller (17:38) matrices b and a which have the appropriate Dimensions to make all the math work out so visually what that looks like is you have your W KN here but then you have BNA a which have far fewer parameters than W KN but when you multiply it together their product it’ll have the proper shape to make all the Matrix operations work here so you’ll actually freeze W KN so you won’t train these parameters and then these parameters housed in BNA will be the trainable ones the result of training the model this way is that you can get (18:11) 100 to even 1,000x savings and model parameters so instead of having to train 10 billion parameters you’re only having to train like 100 million parameters or 50 million parameters so let’s bring these four ingredients together let’s first look at the standard fine tuning approach as a baseline so here let’s say we have our base model represented in fp16 so we’ll have this memory footprint from the base model and then we’ll have this larger memory footprint from the optimizer States and then we won’t have (18:42) any adapters because adapters only come in when doing lowra or another parameter efficient fine-tuning method and so we’ll do like the forward pass on the model it’ll go to the optimizer and then the optimizer will do the backward pass and will update the model parameters this is the same standard fine-tuning approach we talked about earlier so a 10 billion parameter model will require about 160 gbes of memory another thing we could do is use lowra so we can get that 100 to 1,000x Savings in the number (19:14) of trainable parameters we still have our model represented in 16bit but now instead of fine-tuning every single parameter in the model we only have a small number of trainable parameters and then each of those parameters will have an Associated op Optimizer state which significantly reduces the memory footprint so that a 10 billion parameter model would only require about 40 GB of memory while this is a tremendous savings like a 4X Savings in memory 40 GB is still a lot to ask for from consumer Hardware so let’s see how Cur (19:47) helps the situation even further the key thing here is that instead of using the 16bit representation we can use ingredient one and encode the base model as 4bit normal float and then we’ll have the same number of trainable parameters from Lura so that’ll be exactly the same and then we can use ingredient 3 with the paged optimizers to avoid any out of memory errors that might come up during the training process with that and including the double quantization here we can use Cur to fine-tune a 10 billion (20:21) parameter model with just about 12 gigabyt of memory which is something that can easily fit in consumer Hardware and can even run using the free resources available on a Google collab so let’s see a concrete example of that here we’re going to do fine-tuning using mistol 7B instruct to respond to YouTube comments this example is available on the Google collab associated with this video the model and data set are freely available on hugging face and then additionally there is a GitHub repo that has all the resources put together as (20:58) well as the code to generate the training data set here first thing we need to do is import some libraries everything here is coming from hugging face their Transformers Library their PFT Library which is parameter efficient fine-tuning this is what’s going to allow us to do Q lur and then we’re using hugging fa’s data sets library because I uploaded the training data set onto hugging faes Hub and then finally we just import the Transformers Library these are kind of like sub dependencies to ensure some of these modules work I (21:27) think it’s mainly this one prepare mod for kbit training you don’t need to import these but you need to make sure they’re installed in your environment and this was actually a paino because bits and bytes only works on Linux and windows and on Nvidia hardware and then gptq this format for encoding models it doesn’t run on Mac so as a Mac User this was kind of frustrating lots of trial and error to try to get it to work on my machine locally but I wasn’t able to get it to work so if anyone was able to get (21:56) it to run on a M1 or a M2 or or even M3 send me your example code or send me any resources you found helpful I would love to get a version working on my personal machine but since collab they have a Linux environment using Nvidia Hardware the code here works fine next we can load the quantized model and so here we’re going to grab a version of mistol 7B instruct from the bloke and so if you’re not familiar with the bloke he’s actually quantized and shared thousands of these large language models (22:28) completely for free on the hugging face Hub and then we can just import this model using this from pre-trained method so we just need to specify the model name on The Hub device map set to Auto just has the Transformers Library kind of figure out the optimal way to spread the load between the GPU and CPU to load in the model trust remote code basically it’s not going to allow like a custom model file to run on your local machine so this is just a way to protect your machine when downloading code from The (22:59) Hub and then revision main is just saying we want the main version of the model available at this repo here then again gptq which is the format used here does not run on Mac there are some other options with Mac but I wasn’t able to get it working on my machine once we have the quantized model loaded we can load the tokenizer so we can do this actually pretty easily using this from pre-train method so we just specify the model name and then specify this use fast argument as true with just those two simple blocks of code we can use the (23:33) base model one thing we do here is we put the model into evaluation mode which apparently deactivates the Dropout modules next we can craft our prompt so let’s say we have a command from YouTube that says great content thank you and then we put it into the proper prompt format so mistal 7B instruct is an instruction tuned model so it’s actually Expecting The Prompt in a very particular format and namely it’s just expecting this instruction start and instruction end special tokens in the prompt so we set that up very easily (24:09) what this is doing is it’s just going to dynamically take this comment variable and stick it into this prompt here and then once we have that we can pass the prompt to the tokenizer so basically we’re taking this prompt and We’re translating it from a string into an array of numbers and then we can take that array of numbers and we can pass it into our model to generate more text once we do that we can get the outputs and then pass them back into our tokenizer and have the tokenizer decode the vector back into English the output (24:42) of this great content thank you comment is I’m glad you found the content helpful if you have any specific questions or topics you’d like me to cover in the future feel free to ask I’m here to help in the meantime I’d be happy to answer any questions you might have about the content I’ve already provided just just let me know which article or blog post you’re referring to and I’ll do my best to provide you with accurate and up-to-date information thanks for reading and I look forward to (25:06) helping you with any questions you may have so while this is a fine response there are a few issues with it one it’s very long I would never respond to a YouTube comment like this second is it kind of just like repeats itself it’s like glad you found it helpful feel free to ask and then it says happy to answer questions that you have happy to provide you with acurate update information and like look forward to helping you with questions so saying the same thing in different words like a few different (25:32) times and then finally it says thanks for reading and if this is for YouTube comments people aren’t reading this stuff they’re watching videos so one thing we can do to improve model performance is by doing so-called prompt engineering I actually have a in-depth guide on prompt engineering where I talk about seven tricks to kind of improve your prompts in a previous video of the series so feel free to check that out if you’re interested The Prompt that I ended up using here is something that I generated through trial and error and (26:00) the way I did that is using a website called together. which I can link in the description below essentially together. a they have a chat interface kind of like chat GPT but for a lot of open- source models including mistl 7B instruct version 0.2 so I was able to test a lot of prompt ideas and get feedback and just kind of eyeball which gave the best performance and I ended up using that one so I have this set of instructions here sha gbt functioning as a virtual data science consultant on YouTube communicates in clear accessible (26:34) language escalating to technical depth upon request it reacts to feedback aply and ends responses with its signatur sha GPT sha gbt will tailor the length of its responses to match the viewers comment providing concise acknowledgements to brief expressions of gratitude or feedback thus keeping the interaction natural and engaging then I have this instruction please respond to the following comment and then I have this Lambda function where given a comment I’ll piece together this instruction string and comment together (27:07) within the instruction special tokens that the model is expecting with that I can just pass the comment into the prompt template and generate a new prompt what that looks like is this so you see we have the instruction special tokens you see it’s well formatted this is the instructions please respond to the following comment and says great comment thank you Now using this new prompt instead of just passing the comment directly to the model we have this set of instructions with the comment this is the response thank you (27:36) for your kind words I’m glad you found the content helpful sha GPT so this is really good this is actually already pretty close to how I typically respond to YouTube comments and a lot of them tend to be something like this and it appropriately signed off as Sha GPT so people know that it came from an AI and not from me personally well maybe we could just call it here it’s like okay this is good enough let’s just start using this as the comment responder let’s see how we can use Q Laura to improve this model even further using (28:05) fine tuning so the way to do that is we need to prepare the model for training so we’ll put it from eval mode into training mode we’re going to enable gradient checkpointing which isn’t something I talked about and it’s not necessarily part of the qora technique because it’s actually pretty standard it’s just a memory saving technique that clears specific activations and then recomputes them during the backward path of the model and then we need to enable quantized training the base model is (28:32) going to be in 4 bit and we’re going to freeze them but we still want to do training in higher Precision with Lowa we need to make sure we enable this quantize training option next we want to set up lowra so we can use that using this low ra config file I talk more about low RA in the fine-tuning video so just briefly going through this we’re going to set the rank as 8 set the alpha s32 we’re going to Target the query modules in the model we’re going to set drop out to 0.5 we’re not going to have (29:01) any bias values and then we’re going to set the task as causal language modeling with the config file we can pass the model and the config into this method get PFT model so this will just create a lowr trainable version of the model and then we can print the number of trainable parameters so doing that we see that we actually have a significant saving so less than 1% of the original number of trainable parameters just one point of confusion for me personally is it’s showing that mistol 7B instruct has (29:33) 264 million parameters here based on the quick research I did seemed like when you do quantization there could be some terms that you can drop but honestly I don’t fully understand why we went from 7 billion parameters to just 264 million parameters so if anyone knows that please drop it in the comments I’m very curious but the main point here is that we’re only using 0. (29:58) 8% of the original number of train parameters so huge memory savings using lowon next we’re going to load the data set which is freely available on the hugging face Hub it’s called shot GPT YouTube comments also the code to generate this data set is available at the GitHub repo if you’re curious on how to do the formatting and stuff and then here’s an example from this data set you’ll see we have the special token the start string and the end string we have the start instruction and end instruction and then we have the same (30:26) set of instructions as before and then we have the comment here which is a real comment from the YouTube channel then after the instruction string we have the actual response I left to this comment and then I just appended this Shaw GPT sign off so the model learns the appropriate format in style that it should respond to we’ve got a data set of 59 of these examples so not a huge data set at all and then next we need to pre-process the text so this is very similar to how I did it in the previous find tuning video basically we Define (30:59) this tokenized function which if the example is too long so if it’s longer than 52 tokens it’s going to truncate it so it’s not more than this max length and then we’ll return it as numpy values and then we can apply this tokenized function to every single example in the data set using this method here the map method where we have our data set and then we just pass in the tokenized function and set batched equal to true so it doesn’t batches I guess instead of doing it one by one the other thing we (31:32) need to do is this handles if the examples are too long but when you’re training the model each example in a batch they actually need to be the same size so you can actually do matrix multiplication so for that we can create a data cator what that does is if you have multiple examples of different lengths so let’s say you have like four examples in a batch and they’re all of different lengths the data cator will dynamically pad each example example so they have the same length for that we need to define a pad token which I set (32:02) as the end of string token and then I create the data collator using this method here and then I think this is masked language modeling set equal to false and that’s because we’re doing so-called causal language modeling not masked language modeling now we’re ready to start setting up the training process so here we’re setting hyperparameters we have the learning rate batch size number of epoch we’re setting the output directory of the model the learning rate the batch size goes here number epochs (32:28) goes here weight Decay we set it as 0.01 for logging evaluation and save strategy we set it to every Epoch that means every Epoch will print the training loss we’ll evaluate at every Epoch we’ll also print the validation data set loss and then save strategy so we’ll save the model every Epoch in case something goes wrong we’re going to load the best model at the end because maybe the best model was actually at the eighth Epoch and it got worse on the ninth Epoch or something like that gradient (32:54) accumulation is equal to four warm-up steps equal to two so I actually talk a lot about gradient accumulation and weight decay in the previous video on training a large language model from scratch so if you’re curious about what’s going on there you can check out those videos next we’ll set fp16 equal to true so here we’re going to use 16bit values for training and then we’ll enable the paged Optimizer by setting this optim equal to paged atom W 8bit so this is ingredient three from before (33:26) lots of hyper parameters and of course you can spend your whole life tuning and tweaking this but once we have that we can run the training job so we initialize our trainer we give it the model give it our training data set our validation data set training arguments we defined on the previous slide and then the data collator we’re going to silence warnings this is what I saw on an example from hugging face when they were introducing bits and bites so I just did it again here and then we can run the training process this took about (33:53) 10 minutes to run on Google collab so it’s actually pretty quick and this is this is what will get printed the training loss and validation loss so we can see a smooth monotonic decrease of both implying stable training which is good and then once it’s all said and done we have our model and we can use it so if we pass in that same test comment great content thank you we get the response glad you enjoyed it shot GPT and then it even adds this disclaimer that note I am an AI language model I don’t have the ability to feel emotions (34:22) or watch videos I’m here to answer questions and provide explanations so this is good I feel like this is exactly how I would respond to this comment if I wanted to remove the disclaimer I could easily do that with some like string manipulation just keeping all the text before the sign off or something like that but the point is that the fine-tuning process at least from this one example seemed to work pretty nicely let’s try a different comment something more technical like what is fat tailedness the response of the model is (34:50) actually similar to what we saw in the previous video when we fine-tuned open AI model and then we asked it the same question where where it gives a good concise explanation of fat tailedness the only issue is it doesn’t explain fat tailedness the same way that I explained it in my video series on the topic so this brings up one of the limitations of fine-tuning which is that it’s great for capturing style but it’s not always an optimal way to incorporate specialized knowledge into model responses which (35:22) brings us to what’s next instead of trying to give the model even more examples trying to include this specialized knowledge a simpler approach is that we can improve the model’s responses to these types of technical questions by providing it specialized domain knowledge the way we can do that is using a so-called rag system which stands for retrieval augmented generation right now we just get the comment and we pass it into the model with the appropriate prompt and it spits out a response the difference with a rag (35:53) system is that we take the comment we use the comment to extract ra relevant information from a knowledge base and then we incorporate that into the prompt that we pass into the model so that it can generate a response so that’s going to be the focus of the next video in this series we’re going to see how we can improve shot GPT using specialized knowledge coming from my medium blog articles and speaking of medium blog articles if you enjoy this video but you want to learn more check out the article (36:22) published in towards data science on Cur there I cover details that I might have missed in this video here and even though this is a member only story you can access it completely for free using the friend Link in the description below other things I’ll point out is that the code example is available for free on collab there’s more code available on the GitHub and then again the model and the data set are available on hugging face and as always thank you so much for your time and thanks for watching

AI, LLM

fine-tuning single gpu

This post is licensed under CC BY 4.0 by the author.

Trending Tags