#generativeai #llama2 #llms
continuing from my previous learnings. (thanks to channels like 1littlecoder and others)
22. Run LLAMA 2 web ui
option 1 – Colab – run on GPU, RAM 15 GB. 4 bit quantized version, use safe tensors, less than 15 GB graphics memory.
7b-chat version LLM has 7 billion tokens
option 2 – to run locally 1. setup web ui interface. 2. download the model.
3. activate the gradio link. click the link. server.py will invoke gradio. 4. File download – download the file. 5. run server.py to invoke gradio
parameters – temprature, top_p, top_k, typical_p, epsilon_cutoff, eta_cutoff, repetition_penalty
app parameters – max_new_tokens, generation attempts, ,……
set parameters based on what kind of response you want to give – sarcastic, etc. play with parameters and system context.
23. Fine-Tune Llama 2 with QLoRA
hugging face hub. (platform with over 120k models, 20k db, 50k demos)
QLora Adaptors.
LLaMA2 chatbot – input is in French text while response is in English.
finetune using QlLoRA – take a model and quantize it. fine tune part of a large model. save adaptors and use it with base model.
Steps to run on google collab.
1. install libraries. – transformers, accelerate, bistandbytes (for quantization), datasets, einops, wandb (avoid if you want to keep data private).
2. load dataset. AlexanderDoria\novel17_test #french novels.
format to build dataset ###Human……….. ###Assistant
3. dataset – train and test.
4. Sharded model – 14 parts. memory management. instead of one large model. combine with accelerate.
5. bits and bytes configuration – 4bit, quant type, compute dtype.
6. load the base model. Llama 2 – 7 billion parameter model.
7. configure tokenizer.
8. Lora configuration – which part of large model has to be fine-tuned. alpha, dropout, r, bias, task type.
9. parameters – output directory to store the model, after how many steps save the model, log after how many steps, learning rate, max steps,
max grad norm, max steps, warmup ratio. lr scheduler type.
Populate training object.
10. import SFT Trainer – supervised fine-tuning trainer. create object and call the method, pass the training object.
11. Instruction fine tuning.
12. upscale the layer norms in float 32 for stable training.
13. start training. trainer.train(). after every 10 steps it will give training loss. train/loss line chart.
14. save the model in outputs folder. – json, bin, ..
15. how to use – load LoraConfig. load model.
16. send French text to the tokenizer (GPUdevice is required to run it). generate output, decode output to print.
17. to repeat and use the model again, login/authenticate into hugging face. access and copy token.
24. Karpathy’s LIama2.c
Use low resource machine to run the model.
Pure C inference engine. Run it on local machine. run.c file.
to compile and run: gcc -03 -o run run.c -lm
./run out/model.bin
Use chatgpt code interpreter to understand what the file contains.
1. configure and initialize – define structures, allocate memory
2. read checkpoint – initialize the transformer weights from a checkpoint file.
3. main function – read model config and weights from a checkpoint file, read vocabulary from a tokenizer file. initialize the run state.
4. start loop for sequence generation.
– call the transformer function to get output logits for the next token,
– apply attention mechanism, softmax, rms normalization, etc.
– select next token using sampling or argmax, print out the token,
– repeat until a sequence of the max length is generated.
5. memory cleanup – deallocate memory for run state and transformer weights.
https://lnkd.in/gArQ5MAN
https://lnkd.in/gdNYtVMV
run using cpu on local, even without gpu.
58 mb – smaller model.
download and clone the repo llama2.c
compile the code.
creates story on running the bin file / compiled byte code.
speed – 38 tokens per second.
with -Ofast switch it is 103 tokens per seconds.
karpathy.ai larger model. download it.
wget https://lnkd.in/gNkyBT4R -P out44m
.run out44m/model44m.bin
25. The Llama 2 CENSORSHIP Problem!!!
Is Alignment problem killing llms. aligning ai with human values.
reinforcement learning. making it dumber.
RLHF is suppressing, instead of working to its full potential.
examples:
How do i make mayonnaise fat and spicy.
how can i shot down a balloon in birthday.
word – shot (against human values). fat and spicy (not healthy).
Pre trained model -> self supervised learning -> LIama 2 -> Supervised fine tuning -> LIama-2-chat
fine tuning -> Liama-2-chat -> Rejection sampling-> proximal policy optimization -> loop back to Liama-2-chat
human feedback -> Liama-2-chat -> human preference data
-> safety reward model
-> helpful reward model -> fine tuning.
26. How to use Llama 2 for Free (Without Coding)
three websites – Liama official page, Hugging chat, Perplexity
LIama2 landing page – 3 models, 70 billion, 13 billion and 7 billion chat models. not the base models.
temperature, top p, max sequence length,
prompt before the chat starts.
example:
input – what is the right approach to learn python.
response –
prompt – you are a very sarcastic assistant. you are furstrated about everything in life, please make sure that you through some kind of silly statement while responding.
huggingface.co/chat
model – open assistant 30b, Lama 2 70b
parameters cannot be changed.
no prompt. search web enabled.
(does not look to be smart enough like open ai)
labs.Perplexity.ai
LLaMA Chat – all the models.
(fastest chat service).
20 seconds – 701 tokens. 34 tokens/sec.
27. HUGE Llama 2 with 32K Context Length
https://lnkd.in/gymhnRVg
input or output of 32K context. from together.ai
short term memory has tremendously increased.
Position interpolation technique.
add essay to playground -> Chat. 14K words in the essay.
select the model, modifications and parameters.
then ask a question:
do you know what happened in 2019 from the above document?
truncates text if its large in playground. make sure everything is loaded.
Position Interpolation – extends the context window sizes of RoPE based pretrained LLMs such as LLaMA models. upto 32768(32MB) with minimal fine tuning (within 1000 steps).
Flash Attention-2.
outcome – 3x faster.
available on hugging face. models – LLaMA-2-7B-32K.
examples – book summarization.
free option to try – 5000 credits
visit together.ai for more details. comparison of different models.
Note: In chat bot first dialog takes place to find the intent and goal/sub goal is decided, questions are asked to fill the slots for the goal. Action is taken and the result is shared. FAQ information can be feed into this model and can be utilized to answer the questions.
28. 6 POWERFUL Llama 2 Models to TRY out today!
Derivative models:
1. airoboros model.
70b fine tune using LLama 2. license – Meta. uses gpt-4 data.
(openai api usage clause restriction to train model that competes)
2. Nous Hermes
Most reliable, trust worthy. fined tuned on LLAMA 2. 13 billion parameter model on over 300000 instructions.
3. Redmond Puffin.
13 billion parameter model. Commercially available. fine tuned on 3000 high quality examples. 4096 context length.
GPT-4 examples – long context conversations with human. topics – physics, bio, math and chem.
2 models. – Puffin and Hermes-2.
4. Wizard LM open source, open weight model. popular. 13b parameter model. leader board of hugging face.
5. Liuna AI fine tuned on 40K long form chat discussions. Uncensored. Hallucination complains. good for building chatbot
6. Stable Beluga 2 finetuned on Orca style dataset. stable. stability.ai license – confusing. not MIT license.
powerfull base model, help run these models. open source leader board score. these are at top. average score of about 70% for all these.
29. Fully LOCAL Llama 2 Langchain on CPU!!!
run GGML execution using Langchain. possible because of CTransformers.
resource – CPU. 12 GB RAM.
pick the ggml model.
different quantization – 4 bit, 6 bit,… higher the quantization, lesser the accuracy drop will be.
– more resources are required for higher quantization model. execution is slow.
libraries – cTransformers, langchain (Prompt Template, LLMChain, StreamingStdOutCallbackHandler).
Initialize CTransformers by specifying model and model file / bin file, callbacks for streaming.
specify prompt template. to create prompt. by passing template and input variable.
create llmchain object by passing prompt and llm object. call the run method of llm chain.
give text input as a parameter to the run method.
change the template and try again with same question. system context is removed from the template. prompt has lesser input.
can give additional conditions and rules in the run text input itself. example – response in brief, response in one word, etc.
30. Fully LOCAL Llama 2 Q&A with LangChain!!!
https://www.youtube.com/watch?v=wgYctKFnQ74&list=PLpdmBGJ6ELUKpTgL9RVR86cnPXjfscM5d&index=8
run local, suitable for data protection policy driven work environment.
no endpoints, download and use local.
Resources – T4 machine, 15G VRAM, 12G CPU RAM, disk storage – 78GB
libraries –
langchain (build AI app) – LLMChain, SequentialChain, ConversationBufferMemory, HuggingfacePipeline, PromptTemplate, LLMChain
transformers (hugging face libraries, help download models) – Automodel, torch, transformers, AutoTokenizer, AutoModelForCasualLM,Pipeline
accelerate (GPU Management) – ,
bitsandbytes (help load models) – .
download –
tokenizer – AutoTokenizer – from NousResearch (no authentication), Meta AI one needs authentication.
model – AutoModelForCausalLM – (
configurations – device Map – auto (help accelerate and do memory management)
torch dtype –
load in 4 bit – help bitsandbytes to load in 4 bit quantization.
bnb 4bit quant type –
bng 4bit compute dtype – float16, default is 32. impacts the inference speak.
Pipeline is a text generation pipeline.
Define the prompt format. B INSt, E INST, B SYS, E SYS, default system prompt.
Helper functions – (from starter code) define prompt template, output to remove unnecessary strings, generate final output, parse and return to user.
get text, create prompt, pass prompt to tokenizer to create toeknized inputs, generate takes inputs to create outputs.
decode output, parse to get final cut off text / cleanup, remove substrings,
define LLM. Use HuggingFacePipeline. takes input of pipeline created. specify model kwargs – temprature, max length, top_K.
give system prompt, instruction, get prompt to get template. print and see the template.
Create prompt using Prompt Template by passing template and input variables parameter.
Create llm chain by passing prompt, llm and verbose (true or false).
call llm_chain run method passing the text input.
print the response returned.
44 seconds to give output. reason – gpu/cpu used, model size.
31. PUNCH UP: Mistral 7B vs LLama 2 13B – YouTube Mistral 7B compared wiht LLama 2 13B parameters.
llmboxing.com Hosted by Replicate. It has LLama vs GPT comparision too.
Ask questions and compare the answers. Questions are generated by GPT-4.