Getting Started

The model weights can be downloaded from Andrew Karpathy's HuggingFace tinyllamas repo. The default tokenizer, which is compatible with stories110M.bin, stories15M.bin, and stories42M.bin, can be downloaded from our repository or from the llama2.c repository.

Add the package to your local environment via Pkg by running

add https://github.com/kleincode/Llama2.jl

Import the package using

julia> using Llama2

Generate text

config, weights = read_karpathy("bin/transformer/stories15M.bin") # replace with path to model weights
state = RunState{Float32}(config)
transformer = Transformer{Float32}(config, weights, state)
tokenizer = Tokenizer("bin/tokenizer/tokenizer.bin", config.vocab_size) # replace with path to tokenizer
sampler = Sampler{Float32}(1.0f0, 0.9f0, 420)

prompt = "Once upon a"
generate(transformer, tokenizer, sampler, prompt)

"Once upon a time, there was a little girl named Lily. She loved to help her mommy in the kitchen. One day, her mommy was making some yummy cookies and asked Lily to help her. Lily was so excited! \nShe put on her apron and stood on a stool so she could reach the cookies. Her mommy was so proud of her independent little girl. They mixed the ingredients together and put the cookies in the oven. \nAfter a little while, the cookies were ready and they smelled delicious. Lily's mommy let her have a slice and they both enjoyed the warm, tasty cookie. From that day on, Lily loved to help her mommy in the kitchen and help cook yummy treats."

The generate method has several optional keyword arguments, check out its docs for more information.

Use the tokenizer

The tokenizer encodes text into vectors of tokens (integers) and decodes tokens back to text. Usually, the token translation table is read from a file like here:

julia> tokenizer = Tokenizer("bin/tokenizer/tokenizer.bin", 32000)
Tokenizer{Float32}(["<unk>", "\n<s>\n", "\n</s>\n", "<0x00>", "<0x01>", "<0x02>", "<0x03>", "<0x04>", "<0x05>", "<0x06>"  …  "ὀ", "げ", "べ", "边", "还", "黃", "왕", "收", "弘", "给"], Dict("âr" => 28727, " properly" => 6285, "chem" => 14970, " patients" => 22070, " Plan" => 8403, "<0x2A>" => 46, "рос" => 10375, "null" => 4305, "rę" => 15387, "ört" => 21069…), Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  -31731.0, -31732.0, -31733.0, -31734.0, -31735.0, -31736.0, -31737.0, -31738.0, -31739.0, -31740.0])
julia> encode(tokenizer, "Tokens are beautiful! :))")
8-element Vector{Int64}:
     2
 11891
   576
   527
  9561
 29992
   585
   877
julia> decode(tokenizer, 2, 11891)
"Tok"

For more information and how to manually define a tokenizer, check out Tokenizer.

Use the sampler

The sampler's job is to sample (select) the next token from the raw output logits generated by the transformer. In the normal case, it applies a softmax to the logits and samples from the resulting multinomial probability distribution. If the temperature is zero, this is equivalent to just taking the maximum value and becomes deterministic.

julia> sampler_mult = Sampler{Float64}(0.5, 0.0, 1)
Sampler{Float64}(0.5, 0.0, Random.MersenneTwister(1))

julia> [sampler_mult([-0.5, 0.5, 0.2]) for i in 1:10]
10-element Vector{Int64}:
 2
 2
 2
 1
 2
 2
 3
 3
 2
 3

julia> sampler_det = Sampler{Float64}(0.0, 0.0, 1)
Sampler{Float64}(0.0, 0.0, Random.MersenneTwister(42))

julia> [sampler_det([-0.5, 0.5, 0.2]) for i in 1:10]
10-element Vector{Int64}:
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2

julia> sampler_topp = Sampler{Float64}(1.0, 0.5, 1)
Sampler{Float64}(1.0, 0.5, Random.MersenneTwister(1))

julia> [sampler_topp([-0.5, 0.5, 0.2]) for i in 1:10]
10-element Vector{Int64}:
 2
 2
 2
 2
 2
 2
 3
 3
 2
 3

The sampler can work with arbitrary Real types. For more information, check out Sampler.