If you use only a single constant over both vectors (tensor-wise constants), then you will have two errors. In fact, the second vector can be quantized without errors if you use an additional absolute maximum value. This will give you twice the precision to quantize the second vector because the inputs are now spread over a broader range of the I3 data type. Then you can squish the first by 4 and the second by 2. For example, if you have the two vectors: We can increase precision, by squishing each vector only as much as is needed. A normalization constant squishes the input distribution, for example, I5, into the target distribution, for example, I3. The only way to improve quantization is through more normalization constants. However, currently, GPUs do not support other than Int8 data types on the hardware level, and as such, we are out of luck and need to use Int8. That is why I developed new data types in my research. Regarding data types, Int8 is a terrible data type for deep learning. Use a better data type, or use more normalization constants (absolute maximum). Quantization can be enhanced in two ways. How to make quantization methods more precise If we have such errors and propagate them through many layers of a neural network, they accumulate, and they may change the result of a prediction and degrade the prediction quality. This is a quantization error that leads to the loss of information in terms of how precise the information is encoded. We see that our dequantization and quantization led to one error: Now we round to the nearest value: ->.And now we multiple by the range of the target data type I3, which is 4: ->.We find the absolute maximum value of the vector: -> 3.Here the step-by-step recipe for quantization: Let’s say we have the vector in I5, and we want to quantize to I3. Let us say you have a data type I5 with values and a data type, I3, with values, how do you quantize from data type I5 to I3? You follow a two-step procedure: For others, just move ahead and read about the mysteries of emergent features. If you have not encountered quantization, it is likely a hot devilish nightmare that will eat your liver.įor those that say, “Pfff! Why do I need a liver anyways?”. The math and concepts are very simple and straightforward - if you have worked on quantization before. The details are very gritty and complicated, but it is all in the code. Most people do not want to learn more about quantization - and honestly, the small sentence above is already enough information. In a previous version of this blog post, I jokingly had a section with the big title “All You Ever Wanted to Know about Quantization” The section read: “If you quantize from 16-bit to 8-bit, you lose precision which might degrade model prediction quality.” Related Posts Mandatory quantization details I cannot spill all the details because my next project will delve deep into understanding these outlier features, but the space is so rich that I am happy to give you many curious details. ![]() This blog post is a more speculative version of the paper that teases out the super curious details about the fascinating properties surrounding the emergent outlier features I found. I know the claims in the paper are highly robust. This blog post will spill some mandatory details about quantization, but I want to mostly make it about these emergent features that I found in transformers at scale. How that job is done for you through the bitsandbytes library with Hugging Face integration so that you can easily run OPT-175B and BLOOM-176B on a single machine is described in another blog post by my colleague and collaborator Younes Belkada. But everybody is happy if printers do their job. The other pitch talks about emergent outliers in transformers and how they radically change what transformers learn and how they function.įrom that, I learned that quantization research is like printers. One pitch is about how I use advanced quantization methods to achieve no performance degradation transformer inference at scale that makes large models more accessible. I had two pitches for my LLM.int8() paper. ![]() ![]() When I attended NAACL, I wanted to do a little test.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |