هجينج فيس تطلق فاين ماث: مجموعة بيانات جديدة لتدريب الذكاء الاصطناعي القادر على الرياضيات

أطلقت شركة Hugging Face مؤخرًا FineMath، وهي مجموعة بيانات مُنسقة بدقة ومصممة لتمكين تدريب نماذج الذكاء الاصطناعي بقدرات قوية في الاستدلال الرياضي وحل المشكلات. يهدف هذا الإصدار إلى تلبية حاجة ملحة في مجتمع تعلم الآلة لمحتوى تعليمي رياضي عالي الجودة وسهل الوصول إليه. تعتمد FineMath على بيانات CommonCrawl وتهدف إلى تقليل الحواجز أمام الباحثين والمطورين الذين يسعون إلى بناء أنظمة ذكاء اصطناعي أكثر كفاءة في الرياضيات.

ما هي FineMath؟

FineMath هي مجموعة من المحتوى التعليمي الرياضي الذي تم تصفيته من مجموعة بيانات CommonCrawl الواسعة. تم إنشاء مجموعة البيانات عن طريق تدريب مُصنف للمحتوى الرياضي باستخدام التعليقات التوضيحية التي تم إنشاؤها بواسطة LLama-3.1-70B-Instruct. كان الهدف هو الاحتفاظ فقط بالمحتوى التعليمي الأكثر فائدة، مع إعطاء الأولوية للتفسيرات الواضحة وحل المشكلات خطوة بخطوة على الأوراق الأكاديمية المتقدمة.

الميزات الرئيسية والإصدارات:

إصداران رئيسيان:
- FineMath-3+: يحتوي على 34 مليار رمز و 21.4 مليون مستند.
- FineMath-4+: مجموعة فرعية عالية الجودة من FineMath-3+، وتضم 9.6 مليار رمز و 6.7 مليون مستند. تُظهر النماذج المدربة على هذه المجموعة الفرعية أداءً فائقًا في المعايير مثل GSM8k و MATH.
مجموعات بيانات InfiMM-WebMath: تم أيضًا إصدار أجزاء نصية إنجليزية فقط تمت تصفيتها من مجموعة بيانات InfiMM-WebMath-40B:
- InfiMM-WebMath-3+: 20.5 مليار رمز، 13.9 مليون مستند.
- InfiMM-WebMath-4+: 8.5 مليار رمز، 6.3 مليون مستند.
تنسيق مجموعة البيانات: منسقة باستخدام Markdown و LaTeX لسهولة الاستخدام والتكامل مع الأدوات الحالية.
حجم مجموعة البيانات: يتراوح من 10 ميجابايت إلى 100 ميجابايت، مما يجعلها قابلة للإدارة لمختلف البيئات الحسابية.
الترخيص: تم إصداره بموجب ترخيص Open Data Commons Attribution License (ODC-By) v1.0.

عملية تنظيم مجموعة البيانات:

تضمن إنشاء FineMath عملية متعددة المراحل لضمان الجودة العالية:

الاستخراج الأولي للمحتوى والتصنيف: تمت إعادة استخراج صفحات CommonCrawl، وتم استخدام Llama-3.1-70B-Instruct لإنشاء تعليقات توضيحية على مقياس من 3 نقاط، وتقييم المحتوى الرياضي والاستدلال المنطقي والحلول خطوة بخطوة. ثم تم ضبط مُصنف على هذه التعليقات التوضيحية.
استرجاع المزيد من الصفحات المرشحة: لمعالجة القيود المفروضة على المرشحات السابقة، حدد الفريق مجالات مواقع ويب واعدة، وأضاف عناوين URL من OpenWebMath و InfiMM-WebMath، واستعاد عناوين URL التي تمت تصفيتها بسبب تدوين LaTeX. ثم تمت إعادة استخراج المحتوى باستخدام مسار OpenWebMath.
تقييم الجودة المحسن: تم استخدام مقياس أكثر دقة من 5 نقاط لتسجيل المجموعة الموسعة، وتم ضبط مُصنف جديد. ثم تم تطبيق إزالة التكرار باستخدام MinHash-LSH للحصول على FineMath-3+. تم استخدام نفس المصنف على مجموعة بيانات InfiMM-WebMath. تمت تصفية مجموعتي البيانات لإزالة المحتوى غير الإنجليزي.
إزالة التلوث: لمنع تسرب البيانات، تمت إزالة العينات التي تحتوي على تداخلات 13 جرامًا مع مجموعات الاختبار من GSM8k و MATH و MMLU و ARC.

الأداء والنتائج:

أظهرت التقييمات أن FineMath-3+ يتفوق على InfiWebMath الأساسي في معايير GSM8k و MATH. علاوة على ذلك، تتفوق FineMath-4+ على كل من FineMath-3+ و InfiWebMath-4+ في الأداء. يمكن أن يؤدي الجمع بين FineMath-3+ و InfiWebMath-3+ إلى إنتاج ما يقرب من 50 مليار رمز بأداء مماثل لـ FineMath-3+.

أبرز مخطط مجموعة البيانات:

تتضمن مجموعة البيانات حقولًا مثل:

url: عنوان URL لصفحة المصدر
text: محتوى الصفحة
token_count: عدد رموز Llama
char_count: عدد الأحرف
metadata: بيانات تعريف إضافية من OpenWebMath
score: نتيجة الجودة الأولية

الاعتبارات والقيود:

تحتوي مجموعة البيانات على بعض التحيزات المتأصلة، بما في ذلك التركيز على المحتوى باللغة الإنجليزية والتركيز على الأساليب التعليمية الشائعة في الرياضيات. قد تكون محدودة أيضًا في التقاط المحتوى الرياضي المتقدم والحفاظ على التدوين القائم على الصور. يجب أن يكون المستخدمون على دراية بهذه العوامل عند استخدام مجموعة البيانات للتدريب.

البدء:

لتحميل مجموعة البيانات، يمكن للمستخدمين استخدام مكتبة datasets من Hugging Face. يتم توفير نموذج التعليمات البرمجية في المستند الأصلي لتحميل كل من مجموعات finemath-3plus و finemath-4plus الفرعية.

في الختام، تمثل FineMath مساهمة كبيرة في مجال الذكاء الاصطناعي وتعليم الرياضيات. من خلال توفير مجموعة بيانات مُنسقة بعناية وسهلة الوصول إليها، تعمل Hugging Face على تمكين الباحثين والمطورين من بناء أنظمة ذكاء اصطناعي أكثر قدرة وموثوقية لحل المشكلات الرياضية. مع الاعتراف بالقيود المفروضة على مجموعة البيانات، فإن إمكاناتها لتعزيز هذا المجال لا يمكن إنكارها.

المصدر: Hugging Face TB Research

Hugging Face has recently unveiled FineMath, a meticulously curated dataset designed to empower the training of AI models with robust mathematical reasoning and problem-solving capabilities. This release addresses a critical need in the machine learning community for accessible, high-quality mathematical educational content. FineMath is derived from CommonCrawl data and aims to lower the barrier to entry for researchers and developers seeking to build more mathematically adept AI systems.

What is FineMath?

FineMath is a collection of mathematical educational content filtered from the vast CommonCrawl dataset. The dataset was created by training a mathematical content classifier using annotations generated by LLama-3.1-70B-Instruct. The goal was to retain only the most educational content, prioritizing clear explanations and step-by-step problem-solving over advanced academic papers.

Key Features and Versions:

Two primary versions:
- FineMath-3+: Contains 34 billion tokens and 21.4 million documents.
- FineMath-4+: A higher-quality subset of FineMath-3+, featuring 9.6 billion tokens and 6.7 million documents. Models trained on this subset demonstrate superior performance on benchmarks like GSM8k and MATH.
InfiMM-WebMath Datasets: Filtered English text-only portions of the InfiMM-WebMath-40B dataset are also released:
- InfiMM-WebMath-3+: 20.5 billion tokens, 13.9 million documents.
- InfiMM-WebMath-4+: 8.5 billion tokens, 6.3 million documents.
Dataset Format: Formatted with Markdown and LaTeX for ease of use and integration with existing tools.
Dataset Size: Ranging from 10MB to 100MB, making it manageable for various computational environments.
Licensing: Released under the Open Data Commons Attribution License (ODC-By) v1.0.

Dataset Curation Process:

The creation of FineMath involved a multi-stage process to ensure high quality:

Initial Content Extraction and Classification: CommonCrawl pages were re-extracted, and Llama-3.1-70B-Instruct was used to generate annotations on a 3-point scale, assessing mathematical content, logical reasoning, and step-by-step solutions. A classifier was then fine-tuned on these annotations.
Recalling More Candidate Pages: Addressing limitations of previous filters, the team identified promising website domains, added URLs from OpenWebMath and InfiMM-WebMath, and recovered URLs filtered due to LaTeX notation. Content was then re-extracted using the OpenWebMath pipeline.
Refined Quality Assessment: A more granular 5-point scale was used to score the expanded corpus, and a new classifier was fine-tuned. Deduplication using MinHash-LSH was then applied to obtain FineMath-3+. The same classifier was used on the InfiMM-WebMath dataset. Both datasets were filtered to remove non-English content.
Decontamination: To prevent data leakage, samples with 13-gram overlaps against test sets from GSM8k, MATH, MMLU, and ARC were removed.

Performance and Results:

Evaluations have shown that FineMath-3+ outperforms the base InfiWebMath on GSM8k and MATH benchmarks. Furthermore, FineMath-4+ surpasses both FineMath-3+ and InfiWebMath-4+ in performance. Combining FineMath-3+ and InfiWebMath-3+ can yield approximately 50 billion tokens with comparable performance to FineMath-3+.

Dataset Schema Highlights:

The dataset includes fields such as:

url: Source page URL
text: Page content
token_count: Number of Llama tokens
char_count: Character count
metadata: Additional OpenWebMath metadata
score: Raw quality score

Considerations and Limitations:

The dataset has certain inherent biases, including a focus on English language content and an emphasis on popular educational approaches to mathematics. It may also be limited in capturing advanced mathematical content and preserving image-based notation. Users should be aware of these factors when using the dataset for training.

Getting Started:

To load the dataset, users can utilize the datasets library from Hugging Face. Sample code is provided in the original document for loading both finemath-3plus and finemath-4plus subsets.

In conclusion, FineMath represents a significant contribution to the field of AI and mathematics education. By providing a carefully curated and accessible dataset, Hugging Face is empowering researchers and developers to build more capable and reliable AI systems for mathematical problem-solving. While acknowledging the dataset’s limitations, its potential to advance the field is undeniable.

Source: Hugging Face TB Research

القائمة