| 12
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 
 | root@6ec195fa4e80:/workspace# python train.py config/train_shakespeare_char.pyOverriding config with config/train_shakespeare_char.py:
 
 
 
 out_dir = 'out-shakespeare-char'
 eval_interval = 250
 eval_iters = 200
 log_interval = 10
 
 
 always_save_checkpoint = False
 
 wandb_log = False
 wandb_project = 'shakespeare-char'
 wandb_run_name = 'mini-gpt'
 
 dataset = 'shakespeare_char'
 gradient_accumulation_steps = 1
 batch_size = 64
 block_size = 256
 
 
 n_layer = 6
 n_head = 6
 n_embd = 384
 dropout = 0.2
 
 learning_rate = 1e-3
 max_iters = 5000
 lr_decay_iters = 5000
 min_lr = 1e-4
 beta2 = 0.99
 
 warmup_iters = 100
 
 
 
 
 
 tokens per iteration will be: 16,384
 found vocab_size = 65 (inside data/shakespeare_char/meta.pkl)
 Initializing a new model from scratch
 number of parameters: 10.65M
 /workspace/train.py:196: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
 scaler = torch.cuda.amp.GradScaler(enabled=(dtype == 'float16'))
 num decayed parameter tensors: 26, with 10,740,096 parameters
 num non-decayed parameter tensors: 13, with 4,992 parameters
 using fused AdamW: True
 compiling the model... (takes a ~minute)
 step 0: train loss 4.2874, val loss 4.2823
 iter 0: loss 4.2654, time 31455.46ms, mfu -100.00%
 iter 10: loss 3.1462, time 30.46ms, mfu 12.23%
 iter 20: loss 2.7322, time 30.10ms, mfu 12.25%
 iter 30: loss 2.6184, time 30.09ms, mfu 12.26%
 iter 40: loss 2.5757, time 30.33ms, mfu 12.26%
 iter 50: loss 2.5249, time 30.35ms, mfu 12.26%
 iter 60: loss 2.5143, time 30.17ms, mfu 12.27%
 iter 70: loss 2.4947, time 30.19ms, mfu 12.28%
 iter 80: loss 2.4936, time 30.05ms, mfu 12.29%
 iter 90: loss 2.4679, time 30.50ms, mfu 12.28%
 iter 100: loss 2.4594, time 30.74ms, mfu 12.27%
 iter 110: loss 2.4667, time 30.73ms, mfu 12.25%
 iter 120: loss 2.4262, time 30.29ms, mfu 12.26%
 iter 130: loss 2.4127, time 30.42ms, mfu 12.26%
 iter 140: loss 2.4148, time 30.33ms, mfu 12.26%
 iter 150: loss 2.4139, time 30.26ms, mfu 12.27%
 
 |