Show HN: openai-realtime-embedded-SDK Build AI assistants on microcontrollers

github.com

63 points by Sean-Der 7 days ago

Hi HN! This is an SDK for ESP32s (microcontrollers) that runs against OpenAI's new WebRTC service [0] My hope is that people can easily add AI to lots of 'real' devices. Wearable devices, speakers around the house, toys etc... You don't have to write any code, just buy a device and set some env variables.

If you have any feedback/questions I would love to hear! I hope this kicks off a generation of new interesting devices. If you aren't familiar with WebRTC it can do some magical things. Check out WebRTC for the Curious[1] and would love to talk about all the cool things that does also.

[0] https://platform.openai.com/docs/guides/realtime-webrtc

[1] https://webrtcforthecurious.com

kaycebasques 4 days ago

Took a bit of poking to figure out what the use case is. Doesn't seem to be mentioned in the README (usage section is empty) or the intro above. Looks like the main use case is speech-to-speech. Which makes sense since we're talking about embedded products, and text-to-speech (for example) wouldn't usually be relevant (because most embedded products don't have a keyboard interface). Congrats on the launch! Cool to see WebRTC applied to embedded space. Streaming speech-to-speech with WebRTC could make a lot of sense.

  • Sean-Der 4 days ago

    Sorry I forgot to put use cases in! Here are the ones I am excited about.

    * Making a toy. I have had a lot of fun putting a silly/sarcastic voice in toys. My 4 year old thinks it is VERY funny.

    * Smart Speaker/Assistant. I want to put one in each room. If I am in the kitchen it has a prompt to assist with recipes.

    I have A LOT more in the future I want to do. The microcontrollers I was using can't do video yet BUT ESP32 does have newer ones that can. When I pull that I can do smart cameras, then it gets really fun :)

    • kaycebasques 4 days ago

      "Use case" perhaps wasn't the right word for me to use. Maybe "applications" would have been a better word. What this enables is speech-to-speech applications in embedded devices. (From my quick scan) it doesn't seem to do anything around other ML applications that OpenAI could potentially be involved in, such as speech-to-text, text-to-speech, or computer vision.

      But yeah, once I figured out that this enables streaming speech-to-speech applications on embedded devices, then it's easy to think up use cases.

      • swatcoder 4 days ago

        It doesn't help that this was posted to HN with the "Usages" section of the README left blank. That alone would probably have addressed your question. The submission is just a little prematue.

        Beyond that, while it does seem like its primarily vision is for speech-to-speech interfaces, it could easily be stretched to do things like send a templatized text prompt that was constructed based on toggle states, sensor readings, etc and (optimistically) asking for a structured response that could control lights or servos or whatever.

        Generally, this looks like a very early stage in a hobby project (the code practices fall short of my expectations for good embedded work, being presented as a library would be better than as an application, the README needs lots of work, etc), but something more sophisticated isn't too far out of reach.

jonathan-adly 4 days ago

Here is a nice use-case. Put this in a pharmacy - have people hit a button, and ask questions about over-the-counter medications.

Really - any physical place where people are easily overwhelmed, have something like that would be really nice.

With some work - you can probably even run RAG on the questions and answer esoteric things like where the food court in an airport or the ATM in a hotel.

  • swatcoder 4 days ago

    > Put this in a pharmacy - have people hit a button, and ask questions about over-the-counter medications.

    Even if you trust OpenAI's models more than your trained, certified, and insured pharmacist -- the pharmacists, their regulators, and their insurers sure won't!

    They've got a century of sunk costs to consider (and maybe even some valid concern over the answers a model might give on their behalf...)

    Don't be expecting anything like that in an traditional regulated medical setting any time soon.

    • dymk 4 days ago

      The last few doctors appointments I’ve had, the clinician used a service to record and summarize the visit. It was using some sort of TTS and LLM to do so. It’s already in medical settings.

      • swatcoder 4 days ago

        Transcription and summary is a vastly different thing than providing medical advice to patients.

  • pixelsort 4 days ago

    Thanks for digging that out. Yes, that makes sense to me as someone who made a fully local speech-2-speech prototype with Electron, including VAD and AEC. It was responsive but taxing. I had to use a mix of specialty models over onnx/wasm in the renderer and llama.cpp in the main process. One day, multimodal model will just do it all.

roland35 4 days ago

Favorited and starred! I wonder if the real power of this could be in integrating large low cost sensor networks? I think with things like video and audio it might make more sense to bump up to a single board Linux board - but maybe the AI could help parse or create notifications based on sensor readings, and push back events to the real world (lights, solenoids, etc)

I think it would help to either have a freertos example, or if you want to go real crazy create a zephyr integration! It would be a lot of fun to work on AI and microcontroller combination - what a cool niche!

  • Sean-Der 4 days ago

    I’m very curious about what a LLM could deduce if you sent in lots of sensor data.

    I love my Airthings. It don’t know if it’s actionable, but it would be cool to see what conclusions would come up from sending co2 and radon readings in. Could make understanding your home a lot easirr

johanam 4 days ago

Love this! Excited to give it a try.

  • Sean-Der 4 days ago

    Thank you! If you run into problems shoot me a message. I really want to make this easy enough for everyone to build with it.

    I have talked with incredibly creative developers that are hampered by domain knowledge requirements. I hope to see an explosion of cool projects if we get this right :)