Large language models (LLMs) are widely used for tasks like translation, text classification, and customer service, but their power consumption and slow speeds have made them impractical for use on phones and laptops.
Now, researchers at Princeton and Stanford Engineering have successfully made LLMs leaner by trimming redundancies and reducing the precision of their information layers. Stored and accessed locally on a mobile device, the performance of these compressed models is nearly as accurate and nuanced as the uncompressed versions. The team introduced a technique to compress LLM data, which could improve privacy, save energy, and lower costs through the development of a new algorithm.
The Compression Technique
The researchers proposed a generic algorithm designed to compress large datasets or matrices. However, they soon realized that not only are datasets large, but the models being deployed are also becoming increasingly massive. To address this, they applied their algorithm to compress these models, combining two properties:
- Low-Precision Representation:
- Reduces the number of bits used to store data, speeding up storage and processing while improving energy efficiency.
- Low-Rank Representation:
- Reduces redundancies in the LLM weight matrices.
By using both techniques together, the team achieved much greater compression than either method could provide individually.
Testing with Llama Models
The team tested their method using Llama 2 and Llama 3, open-source large language models developed by Meta AI. They found their method improved existing low-precision techniques by up to 5%.
Benchmark tasks for LLMs were used in the testing process, including evaluating the logical order of two statements and answering questions involving physical reasoning. The results demonstrated that the compressed models maintained high accuracy while significantly reducing resource requirements.
Enabling Edge Device Deployment
One of the most promising benefits of this technique is its ability to enable LLMs to run directly on edge devices like smartphones and laptops. Fine-tuning compressed LLMs locally enhances privacy, as sensitive data no longer needs to be shared with third-party providers.
However, to make this possible, the LLMs must first be compressed enough to fit on consumer-grade GPUs. The team’s work marks a significant step toward achieving this goal.
The paper, “Compressing Large Language Models using Low Rank and Low Precision Decomposition,” will be presented at the Conference on Neural Information Processing Systems (NeurIPS) in December 2024. This breakthrough could open the door to more energy-efficient, cost-effective, and privacy-conscious AI applications on consumer devices.
For more details, visit: Leaner large language models could enable efficient local use on phones and laptops – Princeton Engineering