This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\nThe following papers were recommended by the Semantic Scholar API
\n- \n
- Head-wise Shareable Attention for Large Language Models (2024) \n
- Why Lift so Heavy? Slimming Large Language Models by Cutting Off the Layers (2024) \n
- Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs (2024) \n
- BGE Landmark Embedding: A Chunking-Free Embedding Method For Retrieval Augmented Long-Context Large Language Models (2024) \n
- Rethinking Optimization and Architecture for Tiny Language Models (2024) \n
Please give a thumbs up to this comment if you found it helpful!
\nIf you want recommendations for any Paper on Hugging Face checkout this Space
\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\n\t recommend
If it can be downloaded I would like to test it on my device
\n","updatedAt":"2024-02-27T02:47:49.975Z","author":{"_id":"65b3605d4c9e50e74aa792d5","avatarUrl":"/avatars/e78a1c4b8521fa667c10c8e8fa78a201.svg","fullname":"jenna_su","name":"jennasu","type":"user","isPro":false,"isHf":false,"isMod":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9824135303497314},"editors":["jennasu"],"reactions":[],"isReport":false}},{"id":"65dd7f755eea01ee95cdfce5","author":{"_id":"6450b03f5af3bded255f7866","avatarUrl":"/avatars/32d43e708d1c0816510430c688650aed.svg","fullname":"Muzaffar Ahmad Mir","name":"muzaffarahmadmir","type":"user","isPro":false,"isHf":false,"isMod":false},"createdAt":"2024-02-27T06:21:41.000Z","type":"comment","data":{"edited":true,"hidden":true,"hiddenBy":"","latest":{"raw":"This comment has been hidden","html":"This comment has been hidden","updatedAt":"2024-02-27T06:21:55.365Z","author":{"_id":"6450b03f5af3bded255f7866","avatarUrl":"/avatars/32d43e708d1c0816510430c688650aed.svg","fullname":"Muzaffar Ahmad Mir","name":"muzaffarahmadmir","type":"user","isPro":false,"isHf":false,"isMod":false}},"numEdits":0,"editors":[],"reactions":[]}},{"id":"65df02a132d7506642043171","author":{"_id":"62a6eac831a06b2439ab579a","avatarUrl":"/avatars/4a310a237e52b67b14864b82f6812d02.svg","fullname":"Wei Du","name":"weidu","type":"user","isPro":false,"isHf":false,"isMod":false},"createdAt":"2024-02-28T09:53:37.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Looking forward to trying it! Layer sharing saves only the memory, not the computation, so here is a thought on combining it with LORA: fine tune the shared layers with a low-rank update. Then you have different weights for each layer but increase little parameter number.","html":"Looking forward to trying it! Layer sharing saves only the memory, not the computation, so here is a thought on combining it with LORA: fine tune the shared layers with a low-rank update. Then you have different weights for each layer but increase little parameter number.
\n","updatedAt":"2024-02-28T09:53:37.439Z","author":{"_id":"62a6eac831a06b2439ab579a","avatarUrl":"/avatars/4a310a237e52b67b14864b82f6812d02.svg","fullname":"Wei Du","name":"weidu","type":"user","isPro":false,"isHf":false,"isMod":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8635212182998657},"editors":["weidu"],"reactions":[],"isReport":false}},{"id":"65e020729c7ca80a040a448e","author":{"_id":"651c4fa8edc1d15d31028e62","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/651c4fa8edc1d15d31028e62/Yt-92KlsinjgNMlDAFCoQ.jpeg","fullname":"Jonathan Jordan","name":"jonathanjordan21","type":"user","isPro":false,"isHf":false,"isMod":false,"followerCount":7},"createdAt":"2024-02-29T06:13:06.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Interesting. If the findings hold true for all small LLMs, then it is very possible to cut down encoder-decoder model size by applying layer sharing to the decoder part of the model. Model size has always been an issue for encoder-decoder models.","html":"Interesting. If the findings hold true for all small LLMs, then it is very possible to cut down encoder-decoder model size by applying layer sharing to the decoder part of the model. Model size has always been an issue for encoder-decoder models.
\n","updatedAt":"2024-02-29T06:13:06.269Z","author":{"_id":"651c4fa8edc1d15d31028e62","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/651c4fa8edc1d15d31028e62/Yt-92KlsinjgNMlDAFCoQ.jpeg","fullname":"Jonathan Jordan","name":"jonathanjordan21","type":"user","isPro":false,"isHf":false,"isMod":false,"followerCount":7}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8972679376602173},"editors":["jonathanjordan21"],"reactions":[],"isReport":false}},{"id":"65e0566123741b3dac90d08f","author":{"_id":"5e80b7d830dc073f817a2bc0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1585493970035-noauth.jpeg","fullname":"Haris Jabbar","name":"maveriq","type":"user","isPro":false,"isHf":false,"isMod":false,"followerCount":4},"createdAt":"2024-02-29T10:03:13.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Could someone reproduce a model config that would duplicate the number of parameters with number of layers, heads, key-value heads and embedding dimension, given in the paper?\n\nI used Llama config with additionally setting tie_word_embeddings=True, but I don't get the same number of parameters. Probably I am missing something?\n\nSecondly, the authors didn't mention the pretraining dataset they used. IMHO, controlling for that would be a better setup to measure the effect of model parameters. ","html":"Could someone reproduce a model config that would duplicate the number of parameters with number of layers, heads, key-value heads and embedding dimension, given in the paper?
\nI used Llama config with additionally setting tie_word_embeddings=True, but I don't get the same number of parameters. Probably I am missing something?
\nSecondly, the authors didn't mention the pretraining dataset they used. IMHO, controlling for that would be a better setup to measure the effect of model parameters.
\n","updatedAt":"2024-02-29T10:03:13.648Z","author":{"_id":"5e80b7d830dc073f817a2bc0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1585493970035-noauth.jpeg","fullname":"Haris Jabbar","name":"maveriq","type":"user","isPro":false,"isHf":false,"isMod":false,"followerCount":4}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9045965075492859},"editors":["maveriq"],"reactions":[],"isReport":false}},{"id":"6665481947613a01ee1f2917","author":{"_id":"6186ddf6a7717cb375090c01","avatarUrl":"/avatars/716b6a7d1094c8036b2a8a7b9063e8aa.svg","fullname":"Julien BLANCHON","name":"blanchon","type":"user","isPro":true,"isHf":false,"isMod":false,"followerCount":70},"createdAt":"2024-06-09T06:13:45.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"# MobileLLM: Revolutionizing Efficient Language Models for Smartphones \n\nhttps://cdn-uploads.huggingface.co/production/uploads/6186ddf6a7717cb375090c01/OtZQNyt9xTPzPtlJOvhEc.mp4 \n\n## Links ๐:\n๐ Subscribe: https://www.youtube.com/@Arxflix\n๐ Twitter: https://x.com/arxflix\n๐ LMNT (Partner): https://lmnt.com/\n\n\nBy Arxflix\n","html":"MobileLLM: Revolutionizing Efficient Language Models for Smartphones
\n \n\nLinks ๐:
\n๐ Subscribe: https://www.youtube.com/@Arxflix
๐ Twitter: https://x.com/arxflix
๐ LMNT (Partner): https://lmnt.com/
Good news! The MobileLLM model weights are now publicly available: https://huggingface.co/collections/facebook/mobilellm-6722be18cb86c20ebe113e95
\n","updatedAt":"2024-10-31T00:36:05.552Z","author":{"_id":"660f893bae89429c07a32cdb","avatarUrl":"/avatars/27442b1dab58114cfe10220c040c1156.svg","fullname":"Zechun Liu","name":"zechunliu","type":"user","isPro":false,"isHf":false,"isMod":false,"followerCount":3}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.5618354082107544},"editors":["zechunliu"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2402.14905","authors":[{"_id":"65dc007f2bc96e6f004a0a4f","user":{"_id":"620ae55fe190de4b8f27ba9d","avatarUrl":"/avatars/aaa2d0eadbcf96c2eb9059e3d73c2760.svg","isPro":false,"fullname":"Liu","user":"Zechun","type":"user"},"name":"Zechun Liu","status":"admin_assigned","statusLastChangedAt":"2024-02-26T10:16:15.652Z","hidden":false},{"_id":"65dc007f2bc96e6f004a0a50","user":{"_id":"64d49ef30f76abaf363b88d6","avatarUrl":"/avatars/93b5cc51305ac88198cea1dad8104db2.svg","isPro":false,"fullname":"Changsheng Zhao","user":"mikezhaocs","type":"user"},"name":"Changsheng Zhao","status":"admin_assigned","statusLastChangedAt":"2024-02-26T10:16:22.535Z","hidden":false},{"_id":"65dc007f2bc96e6f004a0a51","user":{"_id":"5f1753f4925b9863e28ad4de","avatarUrl":"/avatars/e56d25a40f2c0e4e6696b9738877fce8.svg","isPro":false,"fullname":"Forrest Iandola","user":"forresti","type":"user"},"name":"Forrest Iandola","status":"admin_assigned","statusLastChangedAt":"2024-02-26T10:16:29.946Z","hidden":false},{"_id":"65dc007f2bc96e6f004a0a52","name":"Chen Lai","hidden":false},{"_id":"65dc007f2bc96e6f004a0a53","user":{"_id":"6344cf73ee1504dbcd5bdfe7","avatarUrl":"/avatars/6dd2bf1f9c5679e5c8c85d62c9836aac.svg","isPro":false,"fullname":"Yuandong Tian","user":"tydsh","type":"user"},"name":"Yuandong Tian","status":"admin_assigned","statusLastChangedAt":"2024-02-26T10:16:48.983Z","hidden":false},{"_id":"65dc007f2bc96e6f004a0a54","user":{"_id":"646159d6051604bda02d6f05","avatarUrl":"/avatars/d606fee0ad542cc085697b12e673b180.svg","isPro":false,"fullname":"igor fedorov","user":"igo77","type":"user"},"name":"Igor Fedorov","status":"admin_assigned","statusLastChangedAt":"2024-02-26T10:16:54.965Z","hidden":false},{"_id":"65dc007f2bc96e6f004a0a55","user":{"_id":"65304b62e7535baecd85d080","avatarUrl":"/avatars/6e546c7d1414bd92c5a7c8d8c404de92.svg","isPro":false,"fullname":"Yunyang Xiong","user":"yunyangx","type":"user"},"name":"Yunyang Xiong","status":"admin_assigned","statusLastChangedAt":"2024-02-26T10:17:03.222Z","hidden":false},{"_id":"65dc007f2bc96e6f004a0a56","name":"Ernie Chang","hidden":false},{"_id":"65dc007f2bc96e6f004a0a57","name":"Yangyang Shi","hidden":false},{"_id":"65dc007f2bc96e6f004a0a58","name":"Raghuraman Krishnamoorthi","hidden":false},{"_id":"65dc007f2bc96e6f004a0a59","user":{"_id":"64b705c73240387159397c71","avatarUrl":"/avatars/da339f07eb58f58756b91eba9af557a1.svg","isPro":false,"fullname":"Liangzhen Lai","user":"liangzhen-lai","type":"user"},"name":"Liangzhen Lai","status":"admin_assigned","statusLastChangedAt":"2024-02-26T10:19:03.248Z","hidden":false},{"_id":"65dc007f2bc96e6f004a0a5a","user":{"_id":"6566681f7b5ed0735812af32","avatarUrl":"/avatars/424905abb2973954b0850c592743b6fb.svg","isPro":false,"fullname":"Vikas Chandra","user":"vchandra","type":"user"},"name":"Vikas Chandra","status":"admin_assigned","statusLastChangedAt":"2024-02-26T10:19:24.981Z","hidden":false}],"publishedAt":"2024-02-22T18:58:55.000Z","submittedOnDailyAt":"2024-02-26T00:37:44.388Z","title":"MobileLLM: Optimizing Sub-billion Parameter Language Models for\n On-Device Use Cases","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"This paper addresses the growing need for efficient large language models\n(LLMs) on mobile devices, driven by increasing cloud costs and latency\nconcerns. We focus on designing top-quality LLMs with fewer than a billion\nparameters, a practical choice for mobile deployment. Contrary to prevailing\nbelief emphasizing the pivotal role of data and parameter quantity in\ndetermining model quality, our investigation underscores the significance of\nmodel architecture for sub-billion scale LLMs. Leveraging deep and thin\narchitectures, coupled with embedding sharing and grouped-query attention\nmechanisms, we establish a strong baseline network denoted as MobileLLM, which\nattains a remarkable 2.7%/4.3% accuracy boost over preceding 125M/350M\nstate-of-the-art models. Additionally, we propose an immediate block-wise\nweight sharing approach with no increase in model size and only marginal\nlatency overhead. The resultant models, denoted as MobileLLM-LS, demonstrate a\nfurther accuracy enhancement of 0.7%/0.8% than MobileLLM 125M/350M. Moreover,\nMobileLLM model family shows significant improvements compared to previous\nsub-billion models on chat benchmarks, and demonstrates close correctness to\nLLaMA-v2 7B in API calling tasks, highlighting the capability of small models\nfor common on-device use cases.","upvotes":126,"discussionId":"65dc00802bc96e6f004a0aac"},"canReadDatabase":false,"canManageCommunity":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65dc25876d290f6b90b4bcd4","avatarUrl":"/avatars/391baa53398a5f116539caeb8f635a04.svg","isPro":false,"fullname":"Stuart So","user":"ckstuart2542","type":"user"},{"_id":"60d34de513f774189902f547","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1640713172129-60d34de513f774189902f547.png","isPro":false,"fullname":"Awsaf","user":"awsaf49","type":"user"},{"_id":"64403d8d7663594a1263fdd4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64403d8d7663594a1263fdd4/9faL_ocHf6W2Jm6vR1zWl.png","isPro":false,"fullname":"Ahmed Khalil","user":"antiquesordo","type":"user"},{"_id":"63a7422854f1d0225b075bfc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63a7422854f1d0225b075bfc/XGYAcDPZG5ZEsNBWG6guw.jpeg","isPro":true,"fullname":"lhl","user":"leonardlin","type":"user"},{"_id":"6557317760cb377db04fcac7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/4t_Snhh2cehjReF_UePKR.jpeg","isPro":false,"fullname":"Pradeep Raje","user":"pradeepraje","type":"user"},{"_id":"63ddc7b80f6d2d6c3efe3600","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63ddc7b80f6d2d6c3efe3600/RX5q9T80Jl3tn6z03ls0l.jpeg","isPro":false,"fullname":"J","user":"dashfunnydashdash","type":"user"},{"_id":"639ede152e13e54dcbb2e393","avatarUrl":"/avatars/356421fa957e7e71a93e2fef598f2028.svg","isPro":false,"fullname":"Anas kham","user":"Annu72772","type":"user"},{"_id":"63a166f981173de5e4cad069","avatarUrl":"/avatars/9936dea87929c7c639e60ca6f54c0ab4.svg","isPro":false,"fullname":"Jaydip","user":"its-jd","type":"user"},{"_id":"64e19d3558076dcc988e7d69","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64e19d3558076dcc988e7d69/-sio2717x2we5PMBM2wCz.jpeg","isPro":false,"fullname":"Zichen Zhang","user":"zhangzzc","type":"user"},{"_id":"64747f7e33192631bacd8831","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64747f7e33192631bacd8831/dstkZJ4sHJSeqLesV5cOC.jpeg","isPro":false,"fullname":"Taufiq Dwi Purnomo","user":"taufiqdp","type":"user"},{"_id":"6538119803519fddb4a17e10","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6538119803519fddb4a17e10/ffJMkdx-rM7VvLTCM6ri_.jpeg","isPro":false,"fullname":"samusenps","user":"samusenps","type":"user"},{"_id":"64a6d853693c32776e627d6c","avatarUrl":"/avatars/f0c30e5c4b911fd890ea4d031594d5b7.svg","isPro":false,"fullname":"pvti","user":"pvti","type":"user"}],"acceptLanguages":["en","*"],"dailyPaperRank":1}">MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
Abstract
This paper addresses the growing need for efficient large language models (LLMs) on mobile devices, driven by increasing cloud costs and latency concerns. We focus on designing top-quality LLMs with fewer than a billion parameters, a practical choice for mobile deployment. Contrary to prevailing belief emphasizing the pivotal role of data and parameter quantity in determining model quality, our investigation underscores the significance of model architecture for sub-billion scale LLMs. Leveraging deep and thin architectures, coupled with embedding sharing and grouped-query attention mechanisms, we establish a strong baseline network denoted as MobileLLM, which attains a remarkable 2.7%/4.3% accuracy boost over preceding 125M/350M state-of-the-art models. Additionally, we propose an immediate block-wise weight sharing approach with no increase in model size and only marginal latency overhead. The resultant models, denoted as MobileLLM-LS, demonstrate a further accuracy enhancement of 0.7%/0.8% than MobileLLM 125M/350M. Moreover, MobileLLM model family shows significant improvements compared to previous sub-billion models on chat benchmarks, and demonstrates close correctness to LLaMA-v2 7B in API calling tasks, highlighting the capability of small models for common on-device use cases.
Community
It would be interesting to see a comparison to small encoder-decoder models like instructionRoBERTa or flan-T5.
As a GPU poor I find this paper interesting and I am excited to try them out.
My questions are:
Have you guys considered Knowledge distilling Phi-2-2.7B model into smaller 350M model?
How does the design change affect the in-context learning ability of these models?
Does existing tool-chain PEFT, LORA and optimization techniques like AWQ, EXL2 and GPTQ work on these models?
Why not distilling from a larger model?
The model weights are now publicly available: https://huggingface.co/collections/facebook/mobilellm-6722be18cb86c20ebe113e95
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Head-wise Shareable Attention for Large Language Models (2024)
- Why Lift so Heavy? Slimming Large Language Models by Cutting Off the Layers (2024)
- Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs (2024)
- BGE Landmark Embedding: A Chunking-Free Embedding Method For Retrieval Augmented Long-Context Large Language Models (2024)
- Rethinking Optimization and Architecture for Tiny Language Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
If it can be downloaded I would like to test it on my device
Looking forward to trying it! Layer sharing saves only the memory, not the computation, so here is a thought on combining it with LORA: fine tune the shared layers with a low-rank update. Then you have different weights for each layer but increase little parameter number.
Interesting. If the findings hold true for all small LLMs, then it is very possible to cut down encoder-decoder model size by applying layer sharing to the decoder part of the model. Model size has always been an issue for encoder-decoder models.
Could someone reproduce a model config that would duplicate the number of parameters with number of layers, heads, key-value heads and embedding dimension, given in the paper?
I used Llama config with additionally setting tie_word_embeddings=True, but I don't get the same number of parameters. Probably I am missing something?
Secondly, the authors didn't mention the pretraining dataset they used. IMHO, controlling for that would be a better setup to measure the effect of model parameters.
MobileLLM: Revolutionizing Efficient Language Models for Smartphones
Links ๐:
๐ Subscribe: https://www.youtube.com/@Arxflix
๐ Twitter: https://x.com/arxflix
๐ LMNT (Partner): https://lmnt.com/
Good news! The MobileLLM model weights are now publicly available: https://huggingface.co/collections/facebook/mobilellm-6722be18cb86c20ebe113e95
Models citing this paper 14
Browse 14 models citing this paperDatasets citing this paper 0
No dataset linking this paper