A robot that holds a conversation

A research robot that needed to hold a conversation, not run a script. Working demo into the NSF grant on time.

Working demo into the NSF grant package, on time.

Python · NAOqi · Azure OpenAI · GPT-3.5 Turbo · Whisper · SSH

Professor Keith Green1, the father of architectural robotics, walked into the lab with a NAO v3 humanoid2 under one arm and a deadline. The NSF grant package needed something it did not have yet: a robot that could actually talk with a person. Not pattern-match on "hello" and "goodbye", not page through a scripted dialog tree. Hold a conversation. Across turns. With context.

Stock NAO ships with a graphical authoring tool that points you at a keyword recognizer and a fixed vocabulary. Useful for a museum-floor demo. Not useful for what we needed. The vendor docs talk in marketing terms; the authoring tool is slow and points away from the API surface underneath. So I plugged in, got the IP off the head, and SSH'd into the robot.

Underneath the chrome was a custom Gentoo Linux build and naoqi3 running as the root process. naoqi is SoftBank's robot middleware; it exposes every onboard subsystem (motors, microphones, LEDs, speech, memory) as a service you can call over TCP from any language with a binding. The Python binding is the qi module. naoqi was first cut for NAO v3 around 2008. The model it would eventually orchestrate shipped in 2022. Fourteen years sit between the two ends of the conversation loop.

The obvious next move was to run everything on the robot. That fell over inside the first hour. NAO's onboard CPU is an Atom-class chip from the Obama administration; it can drive the motors and stream audio, but it cannot also run speech recognition and an LLM at conversational latency. So the robot keeps doing what it is good at (microphones, speakers, motors, the gesture timeline) and a MacBook on the same Wi-Fi runs the heavy work. The robot's qi session is a long-lived TCP connection back to the laptop's Python process.

Topology

NAO v3 naoqi · :9559 mics · motors · LEDs MacBook main.py · qi client conversation loop Azure OpenAI GPT-3.5 Turbo + Whisper light say · listen · gesture heavy audio · history · tokens ↑ bottleneck lives here, not on the robot

The thin line is gestural speech RPC: a few hundred bytes of text per turn over Wi-Fi to the robot's qi session on port 9559. The thick line is the inference round trip to Cornell's hosted Azure OpenAI endpoint.

The conversation loop is small enough to fit on screen. Connect to the robot, load a system prompt, then bounce audio through Whisper, the chat history through GPT, and the response through animated speech, in a loop until somebody says goodbye:

session = qi.Session()
session.connect(f"tcp://{ip}:9559")

history = [{"role": "system", "content": open("prompt.txt").read().strip()}]

while True:
    user_input = listen_and_recognize(session)        # ALAudioRecorder → Whisper
    history.append({"role": "user", "content": user_input})

    response = get_gpt_response(history)              # Azure GPT-3.5 Turbo
    history.append({"role": "assistant", "content": response})

    nao_speak_with_animations(session, response)      # ALAnimatedSpeech, contextual

    if "goodbye" in user_input.lower():
        break

The interesting line is nao_speak_with_animations. Setting bodyLanguageMode: "contextual" on ALAnimatedSpeech tells the robot to pick gestures from its library that match the meaning of what it is saying, so the words and the body move together.

The demo went into the NSF grant package on time. The robot answered open questions, held context across multi-turn exchanges, and gestured along with what it was saying.

  1. Keith Evan Green is a professor at Cornell whose book Architectural Robotics: Ecosystems of Bits, Bytes, and Biology (MIT Press, 2016) coined the term and frames the field. mitpress.mit.edu/9780262035065.

  2. NAO v3 is a 58 cm humanoid by Aldebaran Robotics (now SoftBank Robotics, now United Robotics Group). Onboard CPU is an Intel Atom Z530 at 1.6 GHz with 1 GB RAM, four-microphone head array, 25 degrees of freedom. aldebaran.com/en/nao.

  3. NAOqi is the robot's middleware. Versions 2.5+ ship a Python 3 binding (the qi module); earlier releases were Python 2.7 only, which is the gotcha that catches everyone reaching for the SDK in 2024. SDK reference: doc.aldebaran.com/2-5/naoqi.