بوصلة الكلام: تسهيل المحادثات الجماعية من خلال تحديد موقع الصوت

تخيل عالماً يصبح فيه تتبع المحادثات الجماعية أمراً سهلاً، حتى مع وجود ضعف في السمع أو في البيئات الصاخبة. تعمل Google Research و DeepMind على تقريب هذا التصور إلى الواقع من خلال SpeechCompass، وهو نهج جديد لتحسين الترجمة المصاحبة المتنقلة باستخدام تحديد موقع الصوت متعدد الميكروفونات. تعالج هذه التقنية، التي حازت مؤخرًا على جائزة “أفضل ورقة بحثية” في CHI 2025، قيودًا كبيرة في تطبيقات تحويل الكلام إلى نص الحالية: صعوبة تمييز المتحدثين في الإعدادات الجماعية. تهدف SpeechCompass إلى التخلص من الحمل المعرفي الزائد وتعزيز إمكانية الوصول من خلال توفير فصل المتحدثين في الوقت الفعلي والتوجيه الاتجاهي.

التغلب على قيود الترجمة المصاحبة المتنقلة الحالية

غالبًا ما تقوم تطبيقات التعرف التلقائي على الكلام (ASR) الحالية على الأجهزة المحمولة بدمج جميع الكلام المترجم، مما يجعل من الصعب تتبع من يتحدث. يمثل هذا عقبة كبيرة للمستخدمين الذين يحتاجون إلى ميزات إمكانية الوصول أو ترجمة اللغات أو تدوين الملاحظات أو محاضر الاجتماعات. غالبًا ما تكون الحلول الحالية، مثل فصل الكلام السمعي البصري وتضمين المتحدث، غير عملية للاستخدام المحمول بسبب متطلبات الكاميرا أو الحاجة إلى التسجيل المسبق لبصمات الصوت. يقدم SpeechCompass بديلاً أكثر عملية ووعيًا بالخصوصية.

تقديم SpeechCompass: فصل المتحدثين والتوجيه الاتجاهي

تعمل SpeechCompass على تحسين الترجمة المصاحبة المتنقلة بميزتين رئيسيتين:

فصل المتحدثين: فصل المتحدثين في نص ASR مع إشارات مرئية مرمزة بالألوان.
تحديد الموقع في الوقت الفعلي: مؤشرات اتجاهية، مثل الأسهم، توجه المستخدم إلى مصدر الكلام.

يوفر هذا النهج متعدد الميكروفونات العديد من المزايا:

تكاليف حسابية أقل: تعمل الخوارزمية على وحدات تحكم دقيقة صغيرة بذاكرة وحساب محدودين، على عكس الأساليب القائمة على التعلم الآلي.
تقليل زمن الوصول: يتم استخراج المعلومات الاتجاهية من الخصائص الصوتية الأساسية، مما يتيح التشغيل في الوقت الفعلي مع الحد الأدنى من التأخير.
حماية أكبر للخصوصية: يفترض النظام أن المتحدثين منفصلون جسديًا ولا يتطلبون الفيديو أو معلومات تعريف شخصية فريدة.
تشغيل مستقل عن اللغة: يحلل SpeechCompass الاختلافات بين الأشكال الموجية الصوتية دون افتراضات مسبقة حول المحتوى.
إعادة تكوين فورية: يؤدي تحريك الهاتف إلى إعادة تكوين SpeechCompass على الفور.

التنفيذ والتفاصيل الفنية

يتم تنفيذ SpeechCompass في شكلين:

نموذج أولي لحافظة الهاتف: تتيح حافظة هاتف مخصصة بأربعة ميكروفونات متصلة بوحدة تحكم دقيقة منخفضة الطاقة تحديد موقع الصوت بزاوية 360 درجة.
تنفيذ البرنامج: يوفر إصدار برمجي للهواتف الحالية التي تحتوي على اثنين أو أكثر من الميكروفونات (مثل هواتف Pixel) تحديد موقع الصوت بزاوية 180 درجة.

يعالج النظام تحدي ارتداد الصوت في البيئات الداخلية باستخدام خوارزمية تحديد الموقع بناءً على الفرق الزمني للوصول (TDOA). تقدر الخوارزمية TDOA بين أزواج الميكروفونات باستخدام الارتباط المتقاطع المعمم مع تحويل الطور (GCC-PHAT) وتطبق تقديرات إحصائية لتحسين الدقة. يحل استخدام 3 ميكروفونات أو أكثر مشكلات الالتباس “الأمامي الخلفي” الموجودة في حلول الميكروفونين.

واجهة المستخدم وأنماط التصور

يعرض تطبيق Android النصوص المحسّنة بأنماط تصور متعددة للإشارة إلى اتجاه المتحدث:

نص ملون: يتم تمييز المتحدثين بنصوص ملونة مختلفة.
الصور الرمزية الاتجاهية: تشير الأسهم أو الأقراص أو تمييزات الألوان إلى موقع المتحدث.
الخريطة المصغرة: تعرض شاشة تشبه الرادار موقع المتحدث.
مؤشرات الحافة: تبرز الإشارات المرئية حول حواف الشاشة اتجاه المتحدث.
قمع الكلام غير المرغوب فيه: يمكن للمستخدمين النقر على حواف الشاشة لقمع الكلام من اتجاهات معينة، وإزالة كلامهم أو المحادثات غير ذات الصلة، وتعزيز الخصوصية.

الأداء وتقييم المستخدم

أظهرت التقييمات الفنية أن SpeechCompass يمكنه تحديد اتجاه الصوت بدقة، بمتوسط خطأ يتراوح بين 11 درجة و 22 درجة عند مستوى صوت المحادثة العادي. هذه الدقة قابلة للمقارنة بقدرات تحديد موقع الصوت البشري. تفوق تكوين الميكروفون الأربعة باستمرار على إعداد الميكروفون الثلاثة في معدل خطأ الفصل (DER).

كشفت استطلاعات المستخدمين أن تقنية الترجمة المصاحبة المتنقلة الحالية تكافح لتمييز المتحدثين في المحادثات الجماعية. سلطت ملاحظات المستخدمين على نموذج SpeechCompass الأولي الضوء على قيمة التوجيه الاتجاهي، حيث كان النص الملون والأسهم الاتجاهية هما طرق التصور الأكثر تفضيلاً.

التوجهات المستقبلية

تطبيقات SpeechCompass المحتملة واسعة النطاق، بما في ذلك:

إعدادات الفصول الدراسية لمتابعة المناقشات.
اجتماعات العمل والمقابلات والتجمعات الاجتماعية لتتبع تغييرات المتحدثين.

قد يشمل التطوير المستقبلي:

التكامل مع عوامل الشكل القابلة للارتداء مثل النظارات الذكية والساعات الذكية.
تعزيز قوة الضوضاء من خلال التعلم الآلي.
مزيد من التخصيص لتفضيلات التصور.
دراسات طولية لفهم التبني والسلوك في السيناريوهات اليومية.

يمثل SpeechCompass خطوة مهمة نحو جعل التواصل أكثر سهولة وشمولية، ويلهم المزيد من الابتكار في هذا المجال الحيوي.

المصدر: Google Research, Google DeepMind

Imagine a world where following group conversations is effortless, even with hearing impairments or in noisy environments. Google Research and DeepMind are bringing that vision closer to reality with SpeechCompass, a novel approach to enhancing mobile captioning using multi-microphone sound localization. This technology, recently awarded “Best Paper” at CHI 2025, addresses a significant limitation of existing speech-to-text applications: the difficulty in distinguishing speakers in group settings. SpeechCompass aims to eliminate cognitive overload and enhance accessibility by providing real-time speaker diarization and directional guidance.

Overcoming Limitations of Existing Mobile Captioning

Current mobile automatic speech recognition (ASR) apps often concatenate all transcribed speech, making it challenging to follow who is speaking. This presents a considerable hurdle for users requiring accessibility features, language translation, note-taking, or meeting transcripts. Existing solutions, such as audio-visual speech separation and speaker embedding, are often impractical for mobile use due to camera requirements or the need to pre-register voiceprints. SpeechCompass offers a more practical and privacy-conscious alternative.

Introducing SpeechCompass: Diarization and Directional Guidance

SpeechCompass enhances mobile captioning with two key features:

Speaker Diarization: Separating speakers in the ASR transcript with color-coded visual cues.
Real-time Localization: Directional indicators, such as arrows, guide the user to the source of the speech.

This multi-microphone approach offers several advantages:

Lower Computational Costs: The algorithm runs on small microcontrollers with limited memory and compute, unlike ML-based approaches.
Reduced Latency: Directional information is extracted from basic sound properties, enabling real-time operation with minimal lag.
Greater Privacy Preservation: The system assumes that speakers are physically separated and doesn’t require video or unique personally identifying information.
Language-Agnostic Operation: SpeechCompass analyzes differences between audio waveforms without prior assumptions about the content.
Instant Reconfiguration: Moving the phone instantly reconfigures SpeechCompass.

Implementation and Technical Details

SpeechCompass is implemented in two forms:

Phone Case Prototype: A custom phone case with four microphones connected to a low-power microcontroller allows for 360-degree sound localization.
Software Implementation: A software version for existing phones with two or more microphones (e.g., Pixel phones) provides 180-degree localization.

The system addresses the challenge of sound reverberation in indoor environments by using a localization algorithm based on time-difference of arrival (TDOA). The algorithm estimates the TDOA between microphone pairs using Generalized Cross Correlation with Phase Transform (GCC-PHAT) and applies statistical estimations to improve precision. The use of 3 or more microphones solves “front-back” confusion issues present with 2 microphone solutions.

User Interface and Visualization Styles

The Android application displays the enhanced transcripts with multiple visualization styles to indicate speaker direction:

Colored Text: Speakers are distinguished by different colored text.
Directional Glyphs: Arrows, dials, or color highlights point to the speaker’s location.
Minimap: A radar-like display shows the speaker’s position.
Edge Indicators: Visual cues around the screen edges highlight speaker direction.
Unwanted Speech Suppression: Users can click on screen edges to suppress speech from specific directions, removing their own speech or irrelevant conversations, enhancing privacy.

Performance and User Evaluation

Technical evaluations demonstrated that SpeechCompass can accurately localize sound direction, with an average error of 11°-22° at normal conversational loudness. This accuracy is comparable to human sound localization abilities. The four-microphone configuration consistently outperformed the three-microphone setup in diarization error rate (DER).

User surveys revealed that current mobile captioning technology struggles with distinguishing speakers in group conversations. User feedback on the SpeechCompass prototype highlighted the value of directional guidance, with colored text and directional arrows being the most preferred visualization methods.

Future Directions

The potential applications of SpeechCompass are vast, including:

Classroom settings for following discussions.
Business meetings, interviews, and social gatherings for tracking speaker changes.

Future development may include:

Integration with wearable form factors like smart glasses and smartwatches.
Enhanced noise robustness through machine learning.
Further customization of visualization preferences.
Longitudinal studies to understand adoption and behavior in everyday scenarios.

SpeechCompass represents a significant step towards making communication more accessible and inclusive, and it inspires further innovation in this critical area.

Source: Google Research, Google DeepMind

القائمة