🧠 core architectureparams: —
embed_dim256
num_layers6
num_heads8
head_dim32
ff_mult4
max_seq_len256
dropout0.1
⚙️ training & optimization
batch_size16
learning_rate (×1e-4)3.0
num_train_epochs5
weight_decay0.01
gradient_clip1.0
warmup_steps1000
label_smoothing0.1
patience3
max_learning_rate (×1e-4)5.0
min_learning_rate (×1e-6)1.0
adam_epsilon (×1e-8)1.0
gradient_accumulation_steps1
lr_scheduler_type
🧠 task-specific learning rates
shared_lr_mult0.5
head_lr_mult1.0
decoder_lr_mult1.0
🧠 continual learning & adapters
use_task_adapters
use_rotary_embedding
use_ewc
use_replay
use_flash_attention
gradient_checkpointing
adapter_bottleneck64
ewc_lambda1000
replay_buffer_size1000
LoRA mode
LoRA rank8
use_pretrained
training_from_scratch
📋 tasks & datasets
✨ No tasks defined. Click "Add task" and configure columns.
💻 generated dytr codeready to run
# configure your model above
🧠 dytr — Dynamic Transformer Library | GitHub | PyPI