{"id":160,"date":"2019-02-04T23:53:25","date_gmt":"2019-02-04T23:53:25","guid":{"rendered":"https:\/\/projects-old.etc.cmu.edu\/ako\/?p=160"},"modified":"2019-02-13T01:32:28","modified_gmt":"2019-02-13T01:32:28","slug":"week-3-lets-look-at-the-tech","status":"publish","type":"post","link":"https:\/\/projects-old.etc.cmu.edu\/ako\/index.php\/2019\/02\/04\/week-3-lets-look-at-the-tech\/","title":{"rendered":"Week 3: Let&#8217;s look at the tech"},"content":{"rendered":"<p>Like mentioned in the last post, Ako explored both the passthrough AR and speech processing technologies last week. We did some prototyping, some vetting and, finally, settled on the design decisions to fit the tech. This was to make sure that we spend our next weeks focussing on the core essence of our project, which is to create a smooth conversational experience for the player and not worry about implementing the technology itself.<\/p>\n<p>Here, we focus on the two main technical aspects that are our pillars\u00a0over which this project is built upon.<\/p>\n<h2>The AR Part<\/h2>\n<p>The reason we use AR is for the player to be aware of the outside world so that they can interact with the physical game objects while also being able to view virtual content. Our experience involved one main virtual character &#8211; the agent and other help content depending on the game pieces on the table.<\/p>\n<p>Thus, we are now looking at a feature that could\u00a0<em>track game states and pieces\u00a0<\/em>throughout the experience to provide interactive advice and guidance. The headset we are using will be expected to get the input video feed and process the items in the scene. We tried the following combinations:<\/p>\n<p><strong>HoloLens + Vuforia:<\/strong><\/p>\n<p>Pros:<\/p>\n<ul style=\"list-style-type: square;\">\n<li>Official Vuforia Support<\/li>\n<li>Wireless<\/li>\n<\/ul>\n<p>Cons:<\/p>\n<ul style=\"list-style-type: square;\">\n<li>Small &#8211; field of view<\/li>\n<li>Low battery life<\/li>\n<li>Cannot live stream\u00a0when using Vuforia<\/li>\n<\/ul>\n<p><strong>Zed Camera + Vuforia:<\/strong><strong><br \/>\n<\/strong>Not supported due to the stereo nature of the video feed from the Zed Camera.<\/p>\n<p><strong>Zed Camera + ArUco Markers<\/strong><\/p>\n<p>Pros:<\/p>\n<ul style=\"list-style-type: square;\">\n<li>Relatively smooth tracking<\/li>\n<li>Proved compatibility<\/li>\n<li>Versatile for design<\/li>\n<\/ul>\n<p>Cons:<\/p>\n<ul style=\"list-style-type: square;\">\n<li>Need to think about players outside of the headset<\/li>\n<li>Important to rely on the stability of augmented content<\/li>\n<\/ul>\n<p>Thus, we have decided to move forward with the Zed Camera attached to an Oculus Rift and uses ArUco markers to track the game pieces.<\/p>\n<p><strong>Planned AR Pipeline<\/strong><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium wp-image-166\" src=\"https:\/\/projects-old.etc.cmu.edu\/ako\/wp-content\/uploads\/2019\/02\/Steps-1-300x48.png\" alt=\"\" width=\"300\" height=\"48\" srcset=\"https:\/\/projects-old.etc.cmu.edu\/ako\/wp-content\/uploads\/2019\/02\/Steps-1-300x48.png 300w, https:\/\/projects-old.etc.cmu.edu\/ako\/wp-content\/uploads\/2019\/02\/Steps-1.png 453w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium\" src=\"https:\/\/i.imgur.com\/YtCUog8.gif\" width=\"600\" height=\"338\" \/><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-167 size-medium\" src=\"https:\/\/projects-old.etc.cmu.edu\/ako\/wp-content\/uploads\/2019\/02\/Steps-2-e1549316561564-300x45.png\" alt=\"\" width=\"300\" height=\"45\" srcset=\"https:\/\/projects-old.etc.cmu.edu\/ako\/wp-content\/uploads\/2019\/02\/Steps-2-e1549316561564-300x45.png 300w, https:\/\/projects-old.etc.cmu.edu\/ako\/wp-content\/uploads\/2019\/02\/Steps-2-e1549316561564.png 452w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium\" src=\"https:\/\/i.imgur.com\/yAY9Q9B.gif\" width=\"600\" height=\"338\" \/><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium wp-image-168\" src=\"https:\/\/projects-old.etc.cmu.edu\/ako\/wp-content\/uploads\/2019\/02\/Steps-3-e1549324086444-300x43.png\" alt=\"\" width=\"300\" height=\"43\" srcset=\"https:\/\/projects-old.etc.cmu.edu\/ako\/wp-content\/uploads\/2019\/02\/Steps-3-e1549324086444-300x43.png 300w, https:\/\/projects-old.etc.cmu.edu\/ako\/wp-content\/uploads\/2019\/02\/Steps-3-e1549324086444.png 451w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium\" src=\"https:\/\/i.imgur.com\/myl7W04.gif\" width=\"600\" height=\"338\" \/><\/p>\n<h2>The Conversation Part<\/h2>\n<p>Conversation with the AR character is our<em> most important mode of interaction\u00a0<\/em>between the player and the application. Therefore, speech recognition and processing is our other core feature.<\/p>\n<p>Here is a portion interaction chart defining how the interaction between the agent and the player from our week&#8217;s weeks blog post.<br \/>\n<img loading=\"lazy\" decoding=\"async\" class=\"alignnone  wp-image-48\" src=\"https:\/\/projects-old.etc.cmu.edu\/ako\/wp-content\/uploads\/2019\/01\/IMG_20190122_165543693-300x271.jpg\" alt=\"\" width=\"601\" height=\"543\" srcset=\"https:\/\/projects-old.etc.cmu.edu\/ako\/wp-content\/uploads\/2019\/01\/IMG_20190122_165543693-300x271.jpg 300w, https:\/\/projects-old.etc.cmu.edu\/ako\/wp-content\/uploads\/2019\/01\/IMG_20190122_165543693-768x693.jpg 768w, https:\/\/projects-old.etc.cmu.edu\/ako\/wp-content\/uploads\/2019\/01\/IMG_20190122_165543693-1024x924.jpg 1024w, https:\/\/projects-old.etc.cmu.edu\/ako\/wp-content\/uploads\/2019\/01\/IMG_20190122_165543693-1196x1080.jpg 1196w\" sizes=\"(max-width: 601px) 100vw, 601px\" \/><br \/>\nAs we can see, the\u00a0agent gets both\u00a0<em>asked questions and answers questions.<\/em><\/p>\n<p>Thus, we need a 3-step system that can convert Speech to Text, process it and convert the processed data into speech. We make use of multiple available APIs to achieve this.<\/p>\n<p><strong>Step 1: Google Cloud Speech &#8211; Speech to Text<\/strong><\/p>\n<p>We are really inclined to use <a href=\"https:\/\/cloud.google.com\/speech-to-text\/\">Google Cloud\u00a0Speech API<\/a> to convert our users&#8217; input into text which can later be processed by the &#8220;brain&#8221; of our program to give a corresponding response. We use the <a href=\"https:\/\/github.com\/steelejay\/LowkeySpeech\">Unity Lowkey Parser<\/a> project to integrate this into a Unity project.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-174\" src=\"https:\/\/projects-old.etc.cmu.edu\/ako\/wp-content\/uploads\/2019\/02\/google-sound-api-1.png\" alt=\"\" width=\"564\" height=\"207\" srcset=\"https:\/\/projects-old.etc.cmu.edu\/ako\/wp-content\/uploads\/2019\/02\/google-sound-api-1.png 564w, https:\/\/projects-old.etc.cmu.edu\/ako\/wp-content\/uploads\/2019\/02\/google-sound-api-1-300x110.png 300w\" sizes=\"(max-width: 564px) 100vw, 564px\" \/><\/p>\n<p>We loved it for many reasons, one being that it was able to understand multiple English accents. It is a cloud platform that uses neural networks and hence continuously learns from every user that uses these services in the world.<\/p>\n<p><strong>Step 2: Dialogflow for response interpretation<\/strong><\/p>\n<p>Now that we have a\u00a0 text form of the spoken input, we need to process it and give back an appropriate response. This\u00a0<em>processing\u00a0<\/em>is managed by a Google service called <a href=\"https:\/\/cloud.google.com\/dialogflow-enterprise\/\">Dialogflow<\/a>. Dialogflow\u00a0contains a Language Understanding Model that is capable of taking in the given text dialog and interpret it based on the configurations it is set to. The configurations also contain the set responses that need to be returned in every case.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-172\" src=\"https:\/\/projects-old.etc.cmu.edu\/ako\/wp-content\/uploads\/2019\/02\/dialogflow.png\" alt=\"\" width=\"561\" height=\"200\" srcset=\"https:\/\/projects-old.etc.cmu.edu\/ako\/wp-content\/uploads\/2019\/02\/dialogflow.png 561w, https:\/\/projects-old.etc.cmu.edu\/ako\/wp-content\/uploads\/2019\/02\/dialogflow-300x107.png 300w\" sizes=\"(max-width: 561px) 100vw, 561px\" \/><\/p>\n<p><strong>Step 3: Convert Response to Speech<\/strong><\/p>\n<p>Finally, the character needs to speak out the interpreted response to the player. We first tried the IBM Watson to convert the text into speech. The text response wrapped as a JSON in the previous step is unwrapped and converted to SSML (Speech Synthesis Markup Language), a standard markup language for synthetic speech applications. IBM Watson is capable of reading this markup and converting it to an audio clip which gets streamed into Unity.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-176\" src=\"https:\/\/projects-old.etc.cmu.edu\/ako\/wp-content\/uploads\/2019\/02\/ibm-watson.png\" alt=\"\" width=\"568\" height=\"209\" srcset=\"https:\/\/projects-old.etc.cmu.edu\/ako\/wp-content\/uploads\/2019\/02\/ibm-watson.png 568w, https:\/\/projects-old.etc.cmu.edu\/ako\/wp-content\/uploads\/2019\/02\/ibm-watson-300x110.png 300w\" sizes=\"(max-width: 568px) 100vw, 568px\" \/><\/p>\n<p>One major caveat with IBM Watson has been that it contains only one voice &#8211; Allison, which replicates the tone and accent of a female American speaker. We feel that this voice wouldn&#8217;t be as dyanamic or suited for the kind of character design we are looking at. Therefore, we are plannig to try a much newer (and hence, less documented) text to sound API owned by Amazon, called <a href=\"https:\/\/aws.amazon.com\/polly\/\">Amazon Polly<\/a> which also leverages its cloud AWS services.<\/p>\n<h2>Preparing for Quarters<\/h2>\n<p>We are also one week away from Quarters Open House. We have also been preparing ideas, plans and demos to show during quarters. We are also in the process of finalizing our branding materials (posters, half-sheets, and logos) this week!<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone  wp-image-146\" src=\"https:\/\/projects-old.etc.cmu.edu\/ako\/wp-content\/uploads\/2019\/02\/ako-poster-small.jpg\" alt=\"\" width=\"452\" height=\"599\" srcset=\"https:\/\/projects-old.etc.cmu.edu\/ako\/wp-content\/uploads\/2019\/02\/ako-poster-small.jpg 453w, https:\/\/projects-old.etc.cmu.edu\/ako\/wp-content\/uploads\/2019\/02\/ako-poster-small-227x300.jpg 227w\" sizes=\"(max-width: 452px) 100vw, 452px\" \/><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone  wp-image-186\" src=\"https:\/\/projects-old.etc.cmu.edu\/ako\/wp-content\/uploads\/2019\/02\/ako-halfsheet-back0.png\" alt=\"\" width=\"452\" height=\"669\" srcset=\"https:\/\/projects-old.etc.cmu.edu\/ako\/wp-content\/uploads\/2019\/02\/ako-halfsheet-back0.png 563w, https:\/\/projects-old.etc.cmu.edu\/ako\/wp-content\/uploads\/2019\/02\/ako-halfsheet-back0-203x300.png 203w\" sizes=\"(max-width: 452px) 100vw, 452px\" \/><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Like mentioned in the last post, Ako explored both the passthrough AR and speech processing technologies last week. We did some prototyping, some vetting and, finally, settled on the design decisions to fit the tech. This was to make sure that we spend our next weeks focussing on the core essence of our project, which&hellip; <br \/> <a class=\"read-more\" href=\"https:\/\/projects-old.etc.cmu.edu\/ako\/index.php\/2019\/02\/04\/week-3-lets-look-at-the-tech\/\">Read more<\/a><\/p>\n","protected":false},"author":2,"featured_media":188,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-160","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/projects-old.etc.cmu.edu\/ako\/index.php\/wp-json\/wp\/v2\/posts\/160"}],"collection":[{"href":"https:\/\/projects-old.etc.cmu.edu\/ako\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/projects-old.etc.cmu.edu\/ako\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/projects-old.etc.cmu.edu\/ako\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/projects-old.etc.cmu.edu\/ako\/index.php\/wp-json\/wp\/v2\/comments?post=160"}],"version-history":[{"count":14,"href":"https:\/\/projects-old.etc.cmu.edu\/ako\/index.php\/wp-json\/wp\/v2\/posts\/160\/revisions"}],"predecessor-version":[{"id":236,"href":"https:\/\/projects-old.etc.cmu.edu\/ako\/index.php\/wp-json\/wp\/v2\/posts\/160\/revisions\/236"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/projects-old.etc.cmu.edu\/ako\/index.php\/wp-json\/wp\/v2\/media\/188"}],"wp:attachment":[{"href":"https:\/\/projects-old.etc.cmu.edu\/ako\/index.php\/wp-json\/wp\/v2\/media?parent=160"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/projects-old.etc.cmu.edu\/ako\/index.php\/wp-json\/wp\/v2\/categories?post=160"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/projects-old.etc.cmu.edu\/ako\/index.php\/wp-json\/wp\/v2\/tags?post=160"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}