Key Important Points are given here of the biggest open-source Language Model till now- BLOOM
1) A 176B-Parameter Open-Access Multilingual Language Model.
2) Aim is public release of Large Language Model
3) Pretrained models are popular since they provide better results from small labelled data.
4) No prior LLM was publicly released making BLOOM the first one to be in public domain.
5) BLOOM is created by BigScience.
6) BLOOM stands for BigScience Large Open-science Open-access Multilingual Language Model.
7) Language Models relied on joint probability of sequence of words taken. Main key point is probability for a sequence of words should be high.
8) Neural Large Language Models score over prior models as n-gram models grow exponentially with n and clueless of the probability of unknown tokens not in training.
9) Neural Language Model estimates probability of next words given previous words. The initial work on Neural Language Models was based on feed forward network and later recurrent neural model became famous and now Transformer has become the most popular choice for neural model to be used in Large Language Models.
10) The computation charges to construct and run Large Language Models have kept this toolkit out of hands of most in scientific community, with BigScience initiatives this gap is said to be reduced. This also effects governance of technology.
11) It is trained on the ROOTS corpus, 498 Hugging Face datasets.
12) 1.61 terabytes of text that was fed as input.
13) The data was in 46 natural languages and even some popular programming languages.
14) Many groups participated on specific domain evaluations such as biomedical text and historical texts.
15) Pre-processing include deduplication of text with duplicate data and also to removed ids and identifications to respect privacy for users whos data was fed in training.
16) Bias were removed too, and it was verified with prompts technique that there was no bias in data to be trained.
17) Prompted finetuning was performed. Collection of prompts was collected through hackathons by BigScience collaborators. Datasets in public pool of prompts exclude harmful contents and coding languages especially for tasks such as sentiment analysis, question answering and so on.
18) Training data consists of various NLP tasks.
19) Design combinations are huge, to fine tune and confirm the final design following specifications were chosen. Fine tuning, scaling and other parameters are either chosen by standards or are industry standards, in any case definite reasoning is made for the same. The aim was also to handle zero and few shots.
20) For pre-processing right way was devised to convert sentence to tokens. Term fertility, which is number of tokens per dataset of a size was computed to check the quality of pre-processing. Where very high fertility can mean a bad tokenization.
21) Byte Pair Encoding (BPE) algorithm is used for tokenization, with no normalization and a specific regex as specified in paper was used and splits all tokens used in text and even in programming language.
22) The state of art methods at this state in developments with 100+B parameters rely only of decoder only Transformer models. However, some works have tested encoder-decoder and encoder only models too. BLOOM works on decoder only model with prior references differ in the way a positional embedding layer to compute how far query is to the keys. And secondly, a layer of normalization was added to make training more stable.
23) Trained on Jean Zay supercomputer at IDRIS/CNRS for around 3 months with a grant. Training BLOOM took about 3.5 months.
24) Six variants of BLOOM were trained.
25) Performance measured for following tasks:
a) SuperGLUE one-shot task
b) Machine Translation
d) Code Generation
f) Bias, and many more
26) Most results are impressive and detailed values are given in paper referenced below.
 Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, D., … & Manica, M. (2022). Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.