Update documenatation for AGL Voice Assistant

- Update documentation for voice assistant. Bug-AGL: SPEC-5200 Change-Id: Ib0144041b2d152c22038bfb64402c43bf5601501 Signed-off-by: Anuj Solanki <anuj603362@gmail.com> Reviewed-on: https://gerrit.automotivelinux.org/gerrit/c/AGL/documentation/+/30417 Reviewed-by: Jan-Simon Moeller <jsmoeller@linuxfoundation.org> Tested-by: Jan-Simon Moeller <jsmoeller@linuxfoundation.org>
author: Anuj Solanki <anuj603362@gmail.com> 2024-10-16 01:44:44 +0530
committer: Jan-Simon Moeller <jsmoeller@linuxfoundation.org> 2024-10-22 11:49:19 +0000
commit: c970f088816e6d1d6992743d6771bcd03c206c4a (patch)
tree: 05774ba70cb4fa8d732a85fa6f8e14dab9a9414b
parent: 022fdcb40bb169dad57f86e62b3501cf1c6f5474 (diff)
8 files changed, 88 insertions, 28 deletions
diff --git a/docs/06_Component_Documentation/10_agl_voice_agent_assistant.md b/docs/06_Component_Documentation/10_agl_voice_agent_assistant.md
index 97bdd0d..e75382e 100644
--- a/docs/06_Component_Documentation/10_agl_voice_agent_assistant.md
+++ b/docs/06_Component_Documentation/10_agl_voice_agent_assistant.md
@@ -5,10 +5,15 @@ title: AGL Voice Agent / Assistant
 # AGL Voice Agent / Assistant
 
 # Introduction
-A gRPC-based voice agent designed for Automotive Grade Linux (AGL). This service leverages GStreamer, Vosk, Snips, and RASA to seamlessly process user voice commands. It converts spoken words into text, extracts intents from these commands, and performs actions through the Kuksa interface. The voice agent is designed to be modular and extensible, allowing for the addition of new speech recognition and intent extraction models.
+A gRPC-based voice agent designed for Automotive Grade Linux (AGL). This service leverages GStreamer, Vosk, Whisper AI, and Snips to process user voice commands seamlessly. It converts spoken words into text, extracts intents from these commands, and performs actions through the Kuksa interface. The voice agent is modular and extensible, allowing for the addition of new speech recognition and intent extraction models. Whisper AI is available for both online and offline modes, offering more accurate transcription compared to Vosk, though it may run slower in offline mode.
+
+Note: RASA NLU is currently not available as it is not supported by Python 3.12, the version used in AGL.
+
 
 # Installation and Usage
-Before we dive into the detailed components documentation, let's first take a look at how to install and use the voice agent service. All of the features of the voice agent service are encapsulated in the `meta-offline-voice-agent` sub-layer which can be found under `meta-agl-devel` layer. These features are currently part of the `master` branch only. This sub-layer can be built into the final image by using following commands:
+Before diving into the detailed components documentation, let's first look at how to install and use the voice agent service. All of the features of the voice agent service are encapsulated in the `meta-offline-voice-agent` sub-layer, which can be found under the `meta-agl-devel layer`. These features are currently available in the `master` branch only. You can build this sub-layer into the final image using the following commands:
+
+### Building for QEMU x86-64
 
 ```shell
 $ source master/meta-agl/scripts/aglsetup.sh -m qemux86-64 -b build-master agl-demo agl-devel agl-offline-voice-agent
@@ -16,36 +21,73 @@ $ source agl-init-build-env
 $ bitbake agl-ivi-demo-flutter
 ```
 
-After the build is complete, you can run the final image using QEMU. Once the image is running, you can start the voice agent service by running the following command:
+### Building for Raspberry Pi 5
+
+Whisper AI does not work on the emulator; therefore, building and running on a physical device like the Raspberry Pi 5 is recommended. You can set up the build environment specifically for Raspberry Pi 5 using:
+
+```shell
+$ source master/meta-agl/scripts/aglsetup.sh -m raspberrypi5 -b build-master agl-demo agl-devel agl-offline-voice-agent
+$ source agl-init-build-env
+$ bitbake agl-ivi-demo-flutter
+```
+
+<br>
+**Note**: The voice assistant client is already integrated into the `flutter-ics-homescreen`, which is the default homescreen app for `agl-ivi-demo-flutter`. However, if you wish to build any other image, you can use the `flutter-voiceassistant` app, which is a standalone flutter-based client for the voice assistant. You can add `flutter-voiceassistant` to your build by adding the following line to the `conf/local.conf` file:
+
 ```shell
-$ voiceagent-service run-server --default
+FEATURE_PACKAGES_agl-offline-voice-agent:append = " \
+    flutter-voiceassistant \
+"
 ```
 
-The `--default` flag loads the voice agent service with default configuration. The default configuration file looks like this:
+The voice agent service will automatically start on startup with the default configuration located at 
+`/etc/default/voice-agent-config.ini.`
+
+
+The default configuration file looks like this:
 ```ini
 [General]
 base_audio_dir = /usr/share/nlu/commands/
-stt_model_path = /usr/share/vosk/vosk-model-small-en-us-0.15/
+vosk_model_path = /usr/share/vosk/vosk-model-small-en-us-0.15/
+whisper_model_path = /usr/share/whisper/tiny.pt
+whisper_cpp_path = /usr/bin/whisper-cpp
+whisper_cpp_model_path = /usr/share/whisper-cpp/models/tiny.en.bin
 wake_word_model_path = /usr/share/vosk/vosk-model-small-en-us-0.15/
 snips_model_path = /usr/share/nlu/snips/model/
 channels = 1
 sample_rate = 16000
 bits_per_sample = 16
-wake_word = hello auto
+wake_word = hey automotive
 server_port = 51053
 server_address = 127.0.0.1
 rasa_model_path = /usr/share/nlu/rasa/models/
 rasa_server_port = 51054
-rasa_detached_mode = 0
+rasa_detached_mode = 1
 base_log_dir = /usr/share/nlu/logs/
 store_voice_commands = 0
+online_mode = 1
+online_mode_address = online-whisper-asr-service-address
+online_mode_port = online-whisper-asr-service-port
+online_mode_timeout = 15
+mpd_ip = 127.0.0.1
+mpd_port = 6600
 
 [Kuksa]
 ip = 127.0.0.1
-port = 8090
-protocol = ws
-insecure = True
-token = /usr/lib/python3.10/site-packages/kuksa_certificates/jwt/super-admin.json.token
+port = 55555
+protocol = grpc
+insecure = 0
+token =  /usr/lib/python3.12/site-packages/kuksa_certificates/jwt/super-admin.json.token
+tls_server_name = Server
+
+[VSS]
+hostname = localhost
+port = 55555
+protocol = grpc
+insecure = 0
+token_filename = /etc/xdg/AGL/agl-vss-helper/agl-vss-helper.token
+ca_cert_filename = /etc/kuksa-val/CA.pem
+tls_server_name = Server
 
 [Mapper]
 intents_vss_map = /usr/share/nlu/mappings/intents_vss_map.json
@@ -56,8 +98,6 @@ Most of the above configuration variable are self explanatory, however, I'll div
 
 - **`store_voice_commands`**: This variable is used to enable/disable the storage of voice commands. If this variable is set to `1`, then the voice commands will be stored in the `base_audio_dir` directory. The voice commands are stored in the following format: `base_audio_dir/<timestamp>.wav`. The `timestamp` is the time at which the voice command was received by the voice agent service.
 
-- **`rasa_detached_mode`**: This variable is used to enable/disable the detached mode for the RASA NLU engine. If this variable is set to `1`, then the RASA NLU engine will be run in detached mode, i.e. the voice agent service won't run it and will assume that RASA is already running. This is useful when you want to run the RASA NLU engine on a separate machine. If this variable is set to `0`, then the RASA NLU engine will be run as a sub process of the voice agent service.
-
 - **`intents_vss_map`**: This is the path to the file that actually maps the intent output from our intent engine to the VSS signal specification. This file is in JSON format and contains the mapping for all the intents that we want to support. The default file looks like this:
 
     ```json
@@ -70,18 +110,18 @@ Most of the above configuration variable are self explanatory, however, I'll div
         },
         "HVACFanSpeed": {
             "signals": [
-                "Vehicle.Cabin.HVAC.Station.Row1.Left.FanSpeed",
-                "Vehicle.Cabin.HVAC.Station.Row1.Right.FanSpeed",
-                "Vehicle.Cabin.HVAC.Station.Row2.Left.FanSpeed",
-                "Vehicle.Cabin.HVAC.Station.Row2.Right.FanSpeed"
+                "Vehicle.Cabin.HVAC.Station.Row1.Driver.FanSpeed",
+                "Vehicle.Cabin.HVAC.Station.Row1.Passenger.FanSpeed",
+                "Vehicle.Cabin.HVAC.Station.Row2.Driver.FanSpeed",
+                "Vehicle.Cabin.HVAC.Station.Row2.Passenger.FanSpeed"
             ]
         },
         "HVACTemperature": {
             "signals": [
-                "Vehicle.Cabin.HVAC.Station.Row1.Left.Temperature",
-                "Vehicle.Cabin.HVAC.Station.Row1.Right.Temperature",
-                "Vehicle.Cabin.HVAC.Station.Row2.Left.Temperature",
-                "Vehicle.Cabin.HVAC.Station.Row2.Right.Temperature"
+                "Vehicle.Cabin.HVAC.Station.Row1.Driver.Temperature",
+                "Vehicle.Cabin.HVAC.Station.Row1.Passenger.Temperature",
+                "Vehicle.Cabin.HVAC.Station.Row2.Driver.Temperature",
+                "Vehicle.Cabin.HVAC.Station.Row2.Passenger.Temperature"
             ]
         }
     }
@@ -142,21 +182,27 @@ $ voiceagent-service run-server --config path/to/config.ini
 One thing to note here is that all the directory paths in the configuration file should be absolute and always end with a `/`.
 
 # High Level Architecture
-![Voice_Agent_Architecture](images/agl-voice-agent/AGL_Offline_VoiceAgent_(High_Level_Architecture).png)
+![Voice_Agent_Architecture](images/agl-voice-agent/AGL_VoiceAgent_updated_architecture.png)
 
 # Components
 - Voice Agent Service
+    - Whisper AI
     - Vosk Kaldi
     - Snips
     - RASA
 - Voice Assistant App
 
 # Voice Agent Service
-The voice agent service is a gRPC-based service that is responsible for converting spoken words into text, extracting intents from these commands, and performing actions through the Kuksa interface. The service is composed of three main components: Vosk Kaldi, RASA, and Snips.
+The voice agent service is a gRPC-based service that is responsible for converting spoken words into text, extracting intents from these commands, and performing actions through the Kuksa interface. The service is composed of three main components: Whisper AI, Vosk Kaldi, RASA, and Snips.
+
+## Whisper AI
+Whisper AI is a versatile speech recognition model developed by OpenAI. It is designed to convert spoken words into text with high accuracy and supports multiple languages out of the box. Whisper AI offers various pre-trained models, each optimized for different hardware capabilities and performance requirements. The current voice agent service uses Whisper AI for **speech recognition**. In offline mode, Whisper AI processes speech locally on the device, providing an accurate transcription of user commands. While Whisper AI may be slightly slower than other models like Vosk, it offers higher accuracy, especially for complex and noisy inputs. It does not currently support wake word detection, so it is used alongside other tools like Vosk for a comprehensive voice assistant solution.
 
 ## Vosk Kaldi
 Vosk Kaldi is a speech recognition toolkit that is based on Kaldi and Vosk. It is used to convert spoken words into text. It provides us with some official pre-trained models for various popular languages. We can also train our own models using the Vosk Kaldi toolkit. The current voice agent service requires two different models to run, one for **wake-word detection** and one for **speech recognition**. The wake word detection model is used to detect when the user says the wake word, which is "Hey Automotive" by default, we can easily change the default wake word by modifying the config file. The speech recognition model is used to convert the user's spoken words into text.
 
+**Note:** The current **wake word** is set to **"Hey Automotive,"** but you can change it to any wake word of your choice. However, if you plan to use a more specific wake word, you may need to train the Vosk wake word model yourself, as the pre-trained models may not be optimized for your specific wake word.
+
 ## Snips
 Snips NLU (Natural Language Understanding) is a Python based Intent Engine that allows to extract structured information from sentences written in natural language. The NLU engine first detects what the intention of the user is (a.k.a. intent), then extracts the parameters (called slots) of the query. The developer can then use this to determine the appropriate action or response. Our voice agent service uses either Snips or RASA to extract intents from the user's spoken commands.
 
@@ -203,6 +249,9 @@ To set up and run the Snips NLU Intent Engine, follow these steps:
 - The Snips NLU engine is not very accurate as compared to RASA, however, its extremely lightweight and really fast.
 
 ## RASA
+
+**Note:** : RASA is currently not included into the build as it is not supported by Python 3.12, the version used in AGL.
+
 RASA is an open-source machine learning framework for building contextual AI assistants and chatbots. It is based on Python and TensorFlow. It is used to extract intents from the user's spoken commands. The RASA NLU engine is trained on a dataset that contains intents, entities, and sample utterances. The RASA NLU engine is used to parse user queries and extract the intent and relevant entities for further processing.
 
 It is recommended to take a brief look at [RASA Official Documentation](https://rasa.com/docs/rasa/) to get a better understanding of how RASA works.
@@ -279,8 +328,19 @@ To set up and run the RASA NLU Intent Engine, follow these steps:
 - The underlying AI arhictecture of the RASA NLU is extensible and changeable thanks to the TensorFlow backend.
 - The RASA NLU engine is very accurate as compared to Snips, however, its heavy and slightly slow.
 
-# Voice Assistant App
-The voice assistant app is a flutter based application made for Automotive Grade Linux (AGL). It is responsible for interacting with the voice agent service for user voice command recognition, intent extraction, and command execution. It also receives the response from the voice agent service and displays it on the screen. Some app UI screenshots are attached below.
+# Voice Assistant Client
+
+## flutter-ics-homescreen Integration
+The voice assistant client is integrated into the `flutter-ics-homescreen` app, which can interact with the voice agent service to process user voice commands.
+
+| ![Flutter-ics-homescreen_1](images/agl-voice-agent/flutter-ics-homescreen-voice-assistant-1.png) | ![Flutter-ics-homescreen_2](images/agl-voice-agent/flutter-ics-homescreen-voice-assistant-2.png) |
+|:--:|:--:|
+
+| ![Flutter-ics-homescreen_1](images/agl-voice-agent/flutter-ics-homescreen-voice-assistant-3.png) | ![Flutter-ics-homescreen_2](images/agl-voice-agent/flutter-ics-homescreen-voice-assistant-settings.png) |
+|:--:|:--:|
+
+## Voice Assistant App
+The voice assistant app is a standalone, Flutter-based client designed for Automotive Grade Linux (AGL). It is responsible for interacting with the voice agent service for user voice command recognition, intent extraction, and command execution. It also receives the response from the voice agent service and displays it on the screen. Some app UI screenshots are attached below.
 
-![Voice_Agent_App_1](images/agl-voice-agent/voice-assistant-flutter-1.png)
-![Voice_Agent_App_2](images/agl-voice-agent/voice-assistant-flutter-2.png)
+| ![Voice_Agent_App_1](images/agl-voice-agent/voice-assistant-updated-1.png) | ![Voice_Agent_App_2](images/agl-voice-agent/voice-assistant-updated-2.png) |
+|:--:|:--:|
+\ No newline at end of file
diff --git a/docs/06_Component_Documentation/images/agl-voice-agent/AGL_VoiceAgent_updated_architecture.png b/docs/06_Component_Documentation/images/agl-voice-agent/AGL_VoiceAgent_updated_architecture.png
new file mode 100644
index 0000000..b302a45
--- /dev/null
+++ b/docs/06_Component_Documentation/images/agl-voice-agent/AGL_VoiceAgent_updated_architecture.png
diff --git a/docs/06_Component_Documentation/images/agl-voice-agent/flutter-ics-homescreen-voice-assistant-1.png b/docs/06_Component_Documentation/images/agl-voice-agent/flutter-ics-homescreen-voice-assistant-1.png
new file mode 100644
index 0000000..ecd5c04
--- /dev/null
+++ b/docs/06_Component_Documentation/images/agl-voice-agent/flutter-ics-homescreen-voice-assistant-1.png
diff --git a/docs/06_Component_Documentation/images/agl-voice-agent/flutter-ics-homescreen-voice-assistant-2.png b/docs/06_Component_Documentation/images/agl-voice-agent/flutter-ics-homescreen-voice-assistant-2.png
new file mode 100644
index 0000000..7a12045
--- /dev/null
+++ b/docs/06_Component_Documentation/images/agl-voice-agent/flutter-ics-homescreen-voice-assistant-2.png
diff --git a/docs/06_Component_Documentation/images/agl-voice-agent/flutter-ics-homescreen-voice-assistant-3.png b/docs/06_Component_Documentation/images/agl-voice-agent/flutter-ics-homescreen-voice-assistant-3.png
new file mode 100644
index 0000000..2e2b5d1
--- /dev/null
+++ b/docs/06_Component_Documentation/images/agl-voice-agent/flutter-ics-homescreen-voice-assistant-3.png
diff --git a/docs/06_Component_Documentation/images/agl-voice-agent/flutter-ics-homescreen-voice-assistant-settings.png b/docs/06_Component_Documentation/images/agl-voice-agent/flutter-ics-homescreen-voice-assistant-settings.png
new file mode 100644
index 0000000..ebc5007
--- /dev/null
+++ b/docs/06_Component_Documentation/images/agl-voice-agent/flutter-ics-homescreen-voice-assistant-settings.png
diff --git a/docs/06_Component_Documentation/images/agl-voice-agent/voice-assistant-updated-1.png b/docs/06_Component_Documentation/images/agl-voice-agent/voice-assistant-updated-1.png
new file mode 100644
index 0000000..c32b395
--- /dev/null
+++ b/docs/06_Component_Documentation/images/agl-voice-agent/voice-assistant-updated-1.png
diff --git a/docs/06_Component_Documentation/images/agl-voice-agent/voice-assistant-updated-2.png b/docs/06_Component_Documentation/images/agl-voice-agent/voice-assistant-updated-2.png
new file mode 100644
index 0000000..1d70e6c
--- /dev/null
+++ b/docs/06_Component_Documentation/images/agl-voice-agent/voice-assistant-updated-2.png
author	Anuj Solanki <anuj603362@gmail.com>	2024-10-16 01:44:44 +0530
committer	Jan-Simon Moeller <jsmoeller@linuxfoundation.org>	2024-10-22 11:49:19 +0000
commit	c970f088816e6d1d6992743d6771bcd03c206c4a (patch)
tree	05774ba70cb4fa8d732a85fa6f8e14dab9a9414b
parent	022fdcb40bb169dad57f86e62b3501cf1c6f5474 (diff)