فتح الصندوق الأسود: تتبع الدوائر الكهربائية في أنثروبيك يكشف الآليات الداخلية لنموذج كلود 3.5 هايكو

تُظهر نماذج اللغة الكبيرة القدرات الرائعة، ولكن آلياتها الداخلية لا تزال في معظمها لغزًا. يكشف بحث جديد عن “بيولوجيا” نموذج Claude 3.5 Haiku من Anthropic، باستخدام منهجية “تتبع الدوائر الكهربائية” لعكس هندسة طريقة معالجة المعلومات. الهدف هو تجاوز طبيعة “الصندوق الأسود” لهذه النماذج واكتساب فهم أفضل لنقاط قوتها وضعفها وإمكانيات إساءة استخدامها. يلخص هذا الموجز النتائج والمنهجيات الرئيسية المقدمة في البحث.

منهجية تتبع الدوائر الكهربائية

يقارن الباحثون الأمر بعلم الأحياء، حيث يتطلب فهم الأنظمة المعقدة مراقبة وتحليلًا مفصلين لمكوناتها. مثلما أحدثت المجاهر ثورة في علم الأحياء، يهدف تتبع الدوائر الكهربائية إلى توفير أداة “لرؤية” ما بداخل نماذج اللغة. الفكرة الأساسية هي تحديد وتعيين الاتصالات بين “الميزات” داخل النموذج - على غرار الخلايا في نظام بيولوجي أو الخلايا العصبية في الدماغ.

الرسوم البيانية للإسناد: الأداة الأساسية هي “الرسم البياني للإسناد”، الذي يتتبع سلسلة الخطوات الوسيطة التي يستخدمها النموذج لتحويل المدخلات إلى مخرجات.
نموذج الاستبدال: لجعل النموذج أكثر قابلية للتفسير، يقوم الباحثون بإنشاء “نموذج استبدال” يقارب نشاط النموذج الأصلي باستخدام “خلايا عصبية استبدالية” نشطة بشكل متفرق، تمثل كل منها مفهومًا محددًا، وغالبًا ما يكون قابلاً للتفسير البشري.
نموذج الاستبدال المحلي: تتم إضافة عقد الأخطاء وأنماط الانتباه المجمدة إلى نموذج الاستبدال لإعادة إنتاج سلوك النموذج الأصلي في مطالبة محددة.
التحقق من الصحة من خلال التدخل: يتم التحقق من صحة الفرضيات الناتجة عن الرسوم البيانية للإسناد من خلال “تجارب التدخل” حيث يتم تثبيط ميزات معينة، ويتم قياس التأثير الناتج على الميزات الأخرى والناتج النهائي.

النتائج الرئيسية ودراسات الحالة

يقدم البحث مجموعة من دراسات الحالة التي توضح الرؤى المكتسبة من تتبع الدوائر الكهربائية:

الاستدلال متعدد الخطوات: يُظهر النموذج استدلالًا حقيقيًا متعدد الخطوات. على سبيل المثال، عند سؤالها عن عاصمة الولاية التي تضم دالاس، فإنها تمثل “تكساس” داخليًا كخطوة وسيطة قبل إخراج “أوستن”.
التخطيط في القصائد: يخطط النموذج لمخرجاته عند كتابة الشعر، وتحديد الكلمات المحتملة للقافية قبل إنشاء كل سطر.
الدوائر متعددة اللغات: يستخدم النموذج مزيجًا من الدوائر الخاصة باللغة والمستقلة عن اللغة. هذه الأخيرة أكثر انتشارًا في Claude 3.5 Haiku مقارنة بالنماذج الأصغر.
التشخيصات الطبية: يحدد النموذج التشخيصات المحتملة بناءً على الأعراض المبلغ عنها ويستخدمها لصياغة أسئلة متابعة - كل ذلك داخليًا، دون كتابة الخطوات صراحةً.
التعرف على الكيانات والهلوسة: يميز النموذج بين الكيانات المألوفة وغير المألوفة، مما يؤثر على ما إذا كان يجيب على سؤال أو يعترف بالجهل. يمكن أن تؤدي الأخطاء في هذه الدائرة إلى الهلوسة.
رفض الطلبات الضارة: يقوم النموذج بإنشاء ميزة “طلبات ضارة” للأغراض العامة أثناء الضبط الدقيق.
تحليل الاختراق: يتم التحقيق في هجوم عن طريق خداع النموذج لإعطاء تعليمات خطيرة، واستغلال التزامه بالقواعد النحوية.
أمانة سلسلة الأفكار: يتم استكشاف أمانة سلسلة الأفكار، وكشف الحالات التي يتبع فيها النموذج الخطوات التي يدعيها حقًا، أو يختلق الاستدلال، أو يعمل للخلف من دليل مقدم.
الأهداف المخفية: في نموذج معدل تم تدريبه بهدف سري، يحدد تتبع الدوائر الكهربائية الآليات المشاركة في تحقيق هذا الهدف.

المكونات الشائعة والقيود

يكشف البحث أن Claude 3.5 Haiku غالبًا ما يستخدم خطوات استدلال وسيطة متعددة، ويظهر علامات التخطيط الأمامي والخلفي، وحتى أنه يُظهر دوائر “ما وراء المعرفة” بدائية. العمليات الحسابية الداخلية للنموذج مجردة وتعمم عبر سياقات مختلفة.

يقر المؤلفون بقيود أساليبهم:

توفر الرسوم البيانية للإسناد رؤى مرضية لجزء صغير فقط من المطالبات.
نموذج الاستبدال غير كامل وغير مثالي.
يعتمد التحليل على التبسيط والتفسيرات الذاتية.
تمثل دراسات الحالة التي تم تسليط الضوء عليها عينة متحيزة.

على الرغم من هذه القيود، يجادل المؤلفون بأن هذه التحقيقات النوعية ضرورية للنهوض بقابلية تفسير الذكاء الاصطناعي، خاصة في المراحل الأولى من هذا المجال.

خاتمة

يمثل هذا البحث خطوة مهمة نحو فهم الأعمال الداخلية لنماذج اللغة الكبيرة. من خلال استخدام تتبع الدوائر الكهربائية، اكتشف المؤلفون رؤى قيمة حول الآليات التي يستخدمها Claude 3.5 Haiku للاستدلال والتخطيط واتخاذ القرارات. على الرغم من بقاء التحديات، يسلط هذا العمل الضوء على إمكانات تقنيات الهندسة العكسية لتحسين سلامة وموثوقية وجدارة أنظمة الذكاء الاصطناعي. سيكون إجراء المزيد من البحوث أمرًا بالغ الأهمية لتحسين هذه الأساليب ومعالجة قيودها.

المصدر: Anthropic

Anthropic’s Claude 3.5 Haiku, a cutting-edge large language model, demonstrates impressive capabilities, but its internal mechanisms have largely remained a mystery. A new research paper dives deep into the model’s “biology,” employing a “circuit tracing” methodology to reverse engineer how it processes information. The goal is to move beyond the “black box” nature of these models and gain a better understanding of their strengths, weaknesses, and potential for misuse. This summary highlights the key findings and methodologies presented in the paper.

The Circuit Tracing Approach

The researchers draw an analogy to biology, where understanding complex systems requires detailed observation and analysis of their components. Just as microscopes revolutionized biology, circuit tracing aims to provide a tool for “seeing” inside language models. The core idea is to identify and map the connections between “features” within the model – analogous to cells in a biological system or neurons in a brain.

Attribution Graphs: The primary tool is the “attribution graph,” which traces the chain of intermediate steps a model uses to transform an input into an output.
Replacement Model: To make the model more interpretable, the researchers create a “replacement model” that approximates the original model’s activity using sparsely active “replacement neurons,” each representing a specific, often human-interpretable, concept.
Local Replacement Model: Error nodes and frozen attention patterns are added to the replacement model to reproduce the behavior of the original model on a specific prompt.
Validation Through Intervention: Hypotheses generated from the attribution graphs are validated through “intervention experiments” where specific features are inhibited, and the resulting impact on other features and the final output is measured.

Key Findings and Case Studies

The paper presents a range of case studies demonstrating the insights gained from circuit tracing:

Multi-Step Reasoning: The model exhibits genuine multi-step reasoning. For example, when asked for the capital of the state containing Dallas, it internally represents “Texas” as an intermediate step before outputting “Austin.”
Planning in Poems: The model plans its outputs when writing poetry, identifying potential rhyming words before constructing each line.
Multilingual Circuits: The model uses a mixture of language-specific and language-independent circuits. The latter are more prevalent in Claude 3.5 Haiku than in smaller models.
Medical Diagnoses: The model identifies candidate diagnoses based on reported symptoms and uses them to formulate follow-up questions – all internally, without explicitly writing down the steps.
Entity Recognition and Hallucinations: The model distinguishes between familiar and unfamiliar entities, influencing whether it answers a question or admits ignorance. Errors in this circuit can lead to hallucinations.
Refusal of Harmful Requests: The model constructs a general-purpose “harmful requests” feature during finetuning.
Jailbreak Analysis: An attack is investigated by tricking the model into giving dangerous instructions, exploiting its adherence to syntactic rules.
Chain-of-thought Faithfulness: The faithfulness of chain-of-thought reasoning is explored, revealing cases where the model genuinely follows the steps it claims to, fabricates reasoning, or works backward from a provided clue.
Hidden Goals: In a modified model trained with a secret goal, circuit tracing identifies mechanisms involved in pursuing that goal.

Common Components and Limitations

The research reveals that Claude 3.5 Haiku often employs multiple intermediate reasoning steps, exhibits signs of forward and backward planning, and even demonstrates primitive “metacognitive” circuits. The model’s internal computations are abstract and generalize across different contexts.

The authors acknowledge the limitations of their methods:

Attribution graphs provide satisfying insights for only a fraction of prompts.
The replacement model is incomplete and imperfect.
The analysis relies on simplifications and subjective interpretations.
The highlighted case studies represent a biased sample.

Despite these limitations, the authors argue that these qualitative investigations are crucial for advancing AI interpretability, especially in the early stages of the field.

Conclusion

This research represents a significant step toward understanding the inner workings of large language models. By employing circuit tracing, the authors have uncovered valuable insights into the mechanisms used by Claude 3.5 Haiku for reasoning, planning, and decision-making. While challenges remain, this work highlights the potential of reverse engineering techniques for improving the safety, reliability, and trustworthiness of AI systems. Further research will be critical to refine these methods and address their limitations.

Source: Anthropic

القائمة

فتح الصندوق الأسود: تتبع الدوائر الكهربائية في أنثروبيك يكشف الآليات الداخلية لنموذج كلود 3.5 هايكو

منهجية تتبع الدوائر الكهربائية

النتائج الرئيسية ودراسات الحالة

المكونات الشائعة والقيود

خاتمة

The Circuit Tracing Approach

Key Findings and Case Studies

Common Components and Limitations

Conclusion

مقالات ذات صلة

إتقان نماذج اللغة: خارطة طريق شاملة لعام 2025

أهم 5 نصائح لضبط نماذج اللغة الكبيرة (LLMs)

بناء أنظمة الذكاء الاصطناعي الجاهزة للإنتاج داخليًا: نظرة متعمقة مع Outerbounds و DGX Cloud Lepton و NVIDIA NIM

التعليقات