1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67
| root@6ec195fa4e80:/workspace# python train.py config/train_shakespeare_char.py Overriding config with config/train_shakespeare_char.py:
out_dir = 'out-shakespeare-char' eval_interval = 250 eval_iters = 200 log_interval = 10
always_save_checkpoint = False
wandb_log = False wandb_project = 'shakespeare-char' wandb_run_name = 'mini-gpt'
dataset = 'shakespeare_char' gradient_accumulation_steps = 1 batch_size = 64 block_size = 256
n_layer = 6 n_head = 6 n_embd = 384 dropout = 0.2
learning_rate = 1e-3 max_iters = 5000 lr_decay_iters = 5000 min_lr = 1e-4 beta2 = 0.99
warmup_iters = 100
tokens per iteration will be: 16,384 found vocab_size = 65 (inside data/shakespeare_char/meta.pkl) Initializing a new model from scratch number of parameters: 10.65M /workspace/train.py:196: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead. scaler = torch.cuda.amp.GradScaler(enabled=(dtype == 'float16')) num decayed parameter tensors: 26, with 10,740,096 parameters num non-decayed parameter tensors: 13, with 4,992 parameters using fused AdamW: True compiling the model... (takes a ~minute) step 0: train loss 4.2874, val loss 4.2823 iter 0: loss 4.2654, time 31455.46ms, mfu -100.00% iter 10: loss 3.1462, time 30.46ms, mfu 12.23% iter 20: loss 2.7322, time 30.10ms, mfu 12.25% iter 30: loss 2.6184, time 30.09ms, mfu 12.26% iter 40: loss 2.5757, time 30.33ms, mfu 12.26% iter 50: loss 2.5249, time 30.35ms, mfu 12.26% iter 60: loss 2.5143, time 30.17ms, mfu 12.27% iter 70: loss 2.4947, time 30.19ms, mfu 12.28% iter 80: loss 2.4936, time 30.05ms, mfu 12.29% iter 90: loss 2.4679, time 30.50ms, mfu 12.28% iter 100: loss 2.4594, time 30.74ms, mfu 12.27% iter 110: loss 2.4667, time 30.73ms, mfu 12.25% iter 120: loss 2.4262, time 30.29ms, mfu 12.26% iter 130: loss 2.4127, time 30.42ms, mfu 12.26% iter 140: loss 2.4148, time 30.33ms, mfu 12.26% iter 150: loss 2.4139, time 30.26ms, mfu 12.27%
|