top of page

Variator - Accelerating Pre-trained Models with Plug-and-Play Compression Modules

If you're a CXO, founder or investor - follow me on LinkedIn & Twitter, or join my newsletter on my website here. I share latest simplified AI research and tactical advice on building AI products.





Practical Uses

  1. Models like ChatGPT and DALLE can be deployed onto mobile (even IOT, Smart home appliances) devices enabling much superior performance and experience for end users.

  2. Startups can build a true real world personal assistant by embedding these models into all smart home appliances and mobile devices. By being present everywhere, a seamless conversation experience can be made possible.

  3. This will make it possible to embed these models in OS (desktop, mobile, IOT) which will lead to an AI first OS experiences. Subsequently a new wave of AI first applications (startups) will be possible.


Pre-requisite definitions


Model Compression


Reducing size of a neural network without compromising accuracy. This size reduction is needed because bigger models are difficult to deploy on resource-constrained devices.



Problem


Deep learning models that perform well are big in size. Bigger the model, the more storage space it needs, making it difficult to deploy on resource-constrained devices. A bigger model also means a higher inference time and more energy consumption during inference. Due to these deficiencies, these models can not be used in many real-world applications.


Strategies to reduce costs while maintaining performance are needed.



Solution Proposed




A novel plug and play acceleration framework named Variator is proposed. Variator enables PLMs acceleration via devising compression plugins, which can be inserted into PLMs to enhance the inference speed.


Various plugins entail different acceleration ratios, and the system can dynamically choose the appropriate one to trade off response speed and model performance depending on the workload.


Variator only necessitates plugins with minimal parameters and freezes the original parameters of PLMs, substantially lowering the memory and storage requirements.


These compression plugins consist of hidden compression layers and hidden decompression layers.


Hidden compression layers compress multiple hidden (redundant) vectors into one, thereby diminishing the sequence length and enabling model acceleration. Decompression layers recover the processed shorter sequence to the original length, thus preserving token-level information.


Compression plugins can be applied in any layer of PLMs, enabling various levels of acceleration.


These plugins are trained using a a two-step strategy. First, compression plugins are trained on pre-trained PLMs with pre-training corpus. Then the compression plugins trained in the first step are used as initialization for task-specific models.


In both steps, knowledge distillation objectives are applied to train the compression plugins not to alter the hidden vectors produced by PLMs.



How this solution is different


Current methods (pruning, quantization, knowledge distillation, low-rank factorization) compress PLMs into fixed smaller sizes, and do not fulfill the following requirements satisfied by this architecture -

  1. Dynamic Workload - In real-world scenarios, the system workload varies dynamically over time, while the computational resources are fixed. We can use more resources for higher performance when the workload is low, and vice-versa to ensure response efficiency when the workload is high.

  2. Storage Efficiency - These methods typically depend on a large number of additional parameters to construct compressed models, which require amounts of memory space for model training and storage across various tasks and acceleration ratios.


Comparison Evaluation


Comparison results between Variator and baseline models. Here Avg. refers to the average scores on seven datasets. Para. and FLOPs refer to the ratio of the number of additional parameters and floating point operations required by the compressed methods.
Comparison results between Variator and baseline models. Here Avg. refers to the average scores on seven datasets. Para. and FLOPs refer to the ratio of the number of additional parameters and floating point operations required.




7 views

コメント


bottom of page