ديون: مُحسِّن مايكروسوفت القابل للتطوير والفعال من حيث الاتصالات للتعلم الآلي الموزع

يهدف Dion، وهو مُحسِّن تم تطويره بواسطة Microsoft، إلى معالجة الاختناقات في الاتصال الموجودة في تدريب التعلم الآلي الموزع، خاصةً عند استخدام تحديثات متعامدة. بناءً على الأساس الذي وضعه Muon، يقدم Dion طريقة أكثر قابلية للتطوير لتقويم المصفوفات الوزنية عبر أجهزة متعددة، مما يتيح تقاربًا أسرع للنماذج في سيناريوهات التدريب واسعة النطاق. ستتعمق منشور المدونة هذا في الميزات الرئيسية وتفاصيل التنفيذ وإرشادات الاستخدام الخاصة بمُحسِّن Dion، بناءً على الوثائق الرسمية من مستودع Microsoft GitHub.

الميزات الرئيسية لـ Dion

يوفر Dion بديلاً فعالاً من حيث الاتصال لطرق مثل Muon، خاصة في السيناريوهات التي تتضمن مصفوفات مُقسَّمة. تنبع ميزتها الأساسية من استخدامها للتكرار التدريجي المستهلك، وهي تقنية تسمح بالتقويم المباشر على المصفوفات المقسمة، وبالتالي تقليل الحاجة إلى إعادة بناء المصفوفة المكثفة للاتصال. تتضمن الميزات الرئيسية ما يلي:

تحديثات التقويم القابلة للتطوير: يوفر Dion تقاربًا أسرع للنماذج عن طريق حساب تحديثات الوزن التقويمية.
كفاءة الاتصال: يقلل من النفقات العامة للاتصال من خلال العمل مباشرة على المصفوفات المقسمة، وتجنب الحاجة إلى إعادة بناء المصفوفات الكاملة.
ضغط منخفض الرتبة: يتضمن Dion معلمة فرعية ذات رتبة، مما يتيح حساب وتقليل الاتصالات من خلال تقريب منخفض الرتبة. آلية التغذية المرتدة للخطأ تخفف من فقدان المعلومات.
دعم التوازي: يدعم Dion جهازًا واحدًا وPyTorch DDP وPyTorch FSDP2 parallelization. والجدير بالذكر أنه مصمم للعمل مع FSDP2 + TP بينما لا يدعم Muon هذا المزيج مباشرةً.
معالجة الدفعات والتداخل: يعالج المُحسِّن المعلمات على شكل دفعات ويتداخل بينها لتداخل الحساب مع الاتصال، مما يحسن الأداء.

التنفيذ والاستخدام

يتوفر المُحسِّن Dion كحزمة pip، مما يبسط تكاملها في المشاريع الحالية:

pip install git+https://github.com/microsoft/dion.git

يتطلب استخدام Dion تجميعًا دقيقًا للمعلمات، حيث إنه مصمم خصيصًا للأوزان المصفوفة ثنائية الأبعاد. تحتاج المعلمات الأخرى (التحيزات والتضمينات والتطبيع) إلى محسِّنات منفصلة مثل Lion أو AdamW. فيما يلي بعض الاعتبارات الرئيسية لتجميع المعلمات:

أوزان المصفوفة: بشكل أساسي لطبقات nn.Linear. يجب استخدام Dion أو Muon، والاستفادة من تغيير حجم معدل التعلم التلقائي المعتمد على الشكل.
متجهات التحيز وطبقات التطبيع: استخدم Lion أو AdamW.
طبقات التضمين: تعامل معها كمجموعة من المتجهات أحادية الأبعاد واستخدم Lion أو AdamW (سيتم تشغيل Dion بدون خطأ، ولكنه سيعمل بشكل سيئ).
طبقات إلغاء التضمين (LM Head): تعامل معها على أنها متجهات أحادية الأبعاد وقم بتطبيق معدل تعلم أصغر (مهم للأداء؛ استخدم Lion أو AdamW). التعريف اليدوي ضروري.
طبقات الالتواء: دعم تجريبي في Muon باستخدام flatten=True. Dion لا يدعم هذه.

تكوين التدريب الموزع

يعد التكوين المناسب لشبكة الجهاز أمرًا بالغ الأهمية لتشغيل Dion بكفاءة في البيئات الموزعة. يستخدم Dion كائنات DeviceMesh لفهم مخطط التوازي الخاص بالنموذج.

شبكة الجهاز لـ Dion: تدعم ما يصل إلى بعدين للشبكة المقسمة (outer_shard_mesh وinner_shard_mesh) وأي عدد من أبعاد الشبكة المنسوخة الموازية للبيانات. تستفيد شبكة التقسيم الداخلية من التوازي الموتر داخل العقدة. يمكن حذف الشبكات غير المستخدمة.
شبكة الجهاز لـ Muon: تستخدم شبكة جهاز أحادية الأبعاد إما لتقسيم المعلمات (إلغاء التقسيم بكفاءة عبر الكل إلى الكل) أو توزيع العمل (تجميع النتائج بالكامل). تقسيم ثنائي الأبعاد غير مدعوم - استخدم Dion.
مجموعة عمليات DDP: يتم دعم DDP أيضًا، باستخدام مجموعة عمليات PyTorch. إذا لم يتم إعطاء أي مجموعة عمليات، فسيقوم Dion بتنفيذ نفس العمل بشكل متكرر عبر جميع وحدات معالجة الرسومات.

مزامنة التدرج الموازي للبيانات المضغوطة

يدعم Dion مزامنة التدرج الموازي للبيانات المضغوطة، ومن المحتمل أن يتخطى تقليل التدرج الكامل عن طريق مزامنة المصفوفات منخفضة الرتبة فقط، مما يقلل بشكل كبير من الاتصال. تتحكم replicate_mesh_grad_sync في هذه الميزة:

replicate_mesh_grad_sync=True: يقلل Dion من حالات الضغط منخفض الرتبة، مما يؤدي إلى زخم منفصل (يحتاج إلى مزامنة صريحة قبل التدقيق).
replicate_mesh_grad_sync=False: يتوقع Dion أن يكون قد تمت مزامنة التدرجات بالفعل.

عند الاستخدام مع HSDP (التقسيم الهجين الموازي للبيانات)، مرر فقط الشبكة الفرعية المقسمة إلى fully_shard() إذا كان Dion يدير مزامنة التدرج.

التدقيق

قبل حفظ نقطة التدقيق، يجب استدعاء optimizer.synchronize_for_checkpoint() لمزامنة حالات المُحسِّن بسبب الزخم المنفصل لـ Dion، مما يضمن الاتساق. تتطلب أنواع DTensor حفظ نقاط التدقيق باستخدام torch.distributed.checkpoint.

أفضل الممارسات والميزات التجريبية

جزء رتبة Dion: المعلمة الفائقة الأكثر أهمية، والتحكم في ضغط الرتبة المنخفضة. قم بتجربة للعثور على القيمة المثلى لحجم النموذج الخاص بك.
Lion مقابل AdamW: Lion يؤدي بشكل عام أداءً أفضل للمعلمات القياسية عند دمجه مع Dion/Muon.
تقسيم ثنائي الأبعاد: عند استخدام FSDP وTP، قسّم على طول أبعاد مصفوفة مختلفة.
تغيير حجم معدل التعلم: يقوم Dion بتغيير حجم معدل التعلم تلقائيًا. يدعم Muon أيضًا عامل القياس Moonshot AI.
الميزات التجريبية: الدقة المختلطة و Dion الأسرع من خلال خوارزمية Cholesky QR (CQR).

خاتمة

يقدم Dion نهجًا واعدًا لتوسيع نطاق تدريب التعلم الآلي الموزع من خلال توفير مُحسِّن فعال من حيث الاتصال يستفيد من التحديثات التقويمية وضغط الرتبة المنخفضة. إن فهم متطلباته المحددة لتجميع المعلمات وتكوين شبكة الجهاز وإجراءات التدقيق أمر بالغ الأهمية لاستخدام Dion بفعالية وتحقيق الأداء الأمثل في مهام سير العمل التدريبية الموزعة.

المصدر: Microsoft

Dion, an optimizer developed by Microsoft, aims to address the communication bottlenecks present in distributed machine learning training, particularly when using orthonormal updates. Building upon the foundation laid by Muon, Dion introduces a more scalable approach for orthonormalizing weight matrices across multiple devices, enabling faster model convergence in large-scale training scenarios. This blog post will delve into the key features, implementation details, and usage guidelines of the Dion optimizer, based on the official documentation from the Microsoft GitHub repository.

Key Features of Dion

Dion offers a communication-efficient alternative to methods like Muon, particularly in scenarios involving sharded matrices. Its primary advantage stems from its use of amortized power iteration, a technique that allows for orthonormalization directly on sharded matrices, thereby reducing the need for communication-intensive matrix reconstruction. Key features include:

Scalable Orthonormal Updates: Dion provides faster model convergence by computing orthonormal weight updates.
Communication Efficiency: It reduces communication overhead by operating directly on sharded matrices, avoiding the need to reconstruct full matrices.
Low-Rank Compression: Dion incorporates a rank fraction hyperparameter, enabling compute and communication reduction through low-rank approximation. An error feedback mechanism mitigates information loss.
Parallelization Support: Dion supports single device, PyTorch DDP, and PyTorch FSDP2 parallelization. Notably, it is designed to work with FSDP2 + TP while Muon does not support this combination directly.
Batch Processing and Overlapping: The optimizer processes parameters in batches and interleaves them to overlap compute with communication, improving performance.

Implementation and Usage

The Dion optimizer is available as a pip package, simplifying its integration into existing projects:

pip install git+https://github.com/microsoft/dion.git

Using Dion requires careful parameter grouping, as it’s designed specifically for two-dimensional matrix weights. Other parameters (biases, embeddings, normalizations) need separate optimizers like Lion or AdamW. Here are some key considerations for parameter grouping:

Matrix Weights: Primarily for nn.Linear layers. Dion or Muon should be used, benefiting from an automatic shape-dependent learning rate scaling.
Bias Vectors & Normalization Layers: Use Lion or AdamW.
Embedding Layers: Treat them as a collection of 1D vectors and use Lion or AdamW (Dion will run without error, but perform poorly).
Unembedding Layers (LM Head): Treat them as 1D vectors and apply a smaller scaled learning rate (critical for performance; use Lion or AdamW). Manual identification is essential.
Convolution Layers: Experimental support in Muon using flatten=True. Dion does not support these.

Distributed Training Configuration

Proper configuration of the device mesh is crucial for Dion’s efficient operation in distributed environments. Dion utilizes DeviceMesh objects to understand the model’s parallelization scheme.

Device Mesh for Dion: Supports up to two sharded mesh dimensions (outer_shard_mesh and inner_shard_mesh) and any number of data-parallel replicated mesh dimensions. Inner-shard-mesh benefits from intra-node tensor parallelism. Unused meshes can be omitted.
Device Mesh for Muon: Uses a single 1D device mesh for either sharding parameters (efficient unsharding via all-to-all) or distributing work (all-gathering results). 2D sharding not supported - use Dion.
DDP ProcessGroup: DDP is also supported, utilizing PyTorch’s ProcessGroup. If no process_group is given, Dion will redundantly execute the same work across all GPUs.

Compressed Data-Parallel Gradient Sync

Dion supports compressed data-parallel gradient synchronization, potentially skipping the full-gradient all-reduce by synchronizing only low-rank matrices, drastically reducing communication. replicate_mesh_grad_sync controls this feature:

replicate_mesh_grad_sync=True: Dion all-reduces low-rank compressed states, resulting in decoupled momentum (needs explicit synchronization before checkpointing).
replicate_mesh_grad_sync=False: Dion expects gradients have already been synchronized.

When using with HSDP (Hybrid Sharded Data Parallel), pass only the sharded sub-mesh to fully_shard() if Dion manages gradient sync.

Checkpointing

Before saving a checkpoint, optimizer.synchronize_for_checkpoint() must be called to synchronize optimizer states due to Dion’s decoupled momentum, ensuring consistency. DTensor types require saving checkpoints with torch.distributed.checkpoint.

Best Practices and Experimental Features

Dion Rank Fraction: The most critical hyperparameter, controlling low-rank compression. Experiment to find the optimal value for your model size.
Lion vs. AdamW: Lion generally performs better for scalar parameters when combined with Dion/Muon.
2D Sharding: When using both FSDP and TP, shard along different matrix dimensions.
Learning Rate Scaling: Dion scales the learning rate automatically. Muon also supports the Moonshot AI scale factor.
Experimental Features: Mixed precision and faster Dion through Cholesky QR (CQR) algorithm.

Conclusion

Dion offers a promising approach to scaling up distributed machine learning training by providing a communication-efficient optimizer that leverages orthonormal updates and low-rank compression. Understanding its specific requirements for parameter grouping, device mesh configuration, and checkpointing procedures is crucial for effectively utilizing Dion and achieving optimal performance in your distributed training workflows.

Source: Microsoft

القائمة