Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is HUGE in my opinion. Prior to this, in order to get near state-of-the-art speech recognition in your system/application you either had to have/hire expertise to build your own or pay Nuance a significant amount of money to use theirs. Nuance has always been a "big bad" company in my mind. If I recall correctly, they've sued many of their smaller competitors out of existence and only do expensive enterprise deals. I'm glad their near monopoly is coming to an end.

I think Google's API will usher in a lot of new innovative applications.



Other "state-of-the-art" speech recognition solutions already exist. For example, Microsoft has been offering it through its Project Oxford service. https://www.projectoxford.ai/speech


Also, CMUSphinx and Julius:

http://cmusphinx.sourceforge.net/

http://julius.osdn.jp/en_index.php

It is amazingly easy to create speech recognition without going out to any API these days.


I first learned about CMUSphinx from the [Jasper Project](https://jasperproject.github.io/). While Jasper provided an image for the Pi, I decided to go ahead and make a scripted install of CMUSphinx. I spent something like 2 frustrating days attempting to get it installed by hand in a repeatable fashion before giving up.

This was 2 years ago, so maybe it's simple now, but I didn't find it "amazingly easy" back then.

I do have a number of projects where I could definitely use a local speech recognition library. I have used [Python SpeechRecognition](https://github.com/Uberi/speech_recognition/blob/master/exam...) to essentially record and transcribe from a scanner. I wanted to take it further, but google at the time limited the number of requests per day. Today's announcement seems to indicate they will be expanding their free usage, but a local setup would be much better. I'd like to deploy this in a place that might not have reliable Internet.


In my experiences, the issues with building CMU Sphinx are mainly unspecified dependencies, undocumented version requirements, and forgetting to sacrifice the goat when the MSVC redistributable installer pops up.

We've written detailed, up-to-date instructions [1] for installing CMU Sphinx, and now also provide prebuilt binaries [2]!

If you're interested in not sending your audio to Google, CMU Sphinx and other libraries (like Kaldi and Julius), are definitely worth a second look.

[1] https://github.com/Uberi/speech_recognition/blob/master/refe... [2] https://github.com/Uberi/speech_recognition/tree/master/thir...


Yeah I'm gonna leave a reply here just in case I need to find this again (already opened tabs, but you never know). This might be big for a stalled project at work. If this can un-stall that, I'll sure owe you a beer ;)


Would you mind submitting this documentation to CMU? I get the feeling they'd love to at least host a link to them or something to enhance their own documentation?


Thanks for providing this. Will definitely give it a fresh look.


That sounds like my experience with it from about 5 years ago or so. I gave up on it also. It also didn't help that CMUSphinx has had more than one version in development in different languages.


I would note that as a positive... But yeah, 5 years ago things were much much rougher (which is partly why I didn't think it got so much press).

But these days, if you go all the way through their tutorial, and give it a proper read, it's very doable to set up.


Unfortunately, the situation hasn't improved much. Besides, even if you get it set up, the quality of the recognition isn't even close to the one from Google.


As someone who's worked with a lot of these engines, Nuance and IBM are the only really high quality players in the space. CMUSphinx and Julius are fine for low volume operations where you don't need really accurate response rates, but if you want high accuracy neither comes close from my experience.


Right, but they do offer you a fantastic starting point. If Nuance is 100%, I'd say CMUSphinx is at least 40%.

Also, they give you the tools and knowledge to build better models (and explain the theory), which is where most of the competitive advantage is IMHO.


As someone who has actually done objective tests, Google are by far the best, Nuance are a clear second. IBM Watson is awful though. Actually the worst I've tested.


Do you have a report of your tests? I'm interested in using speech recognition, but there are many start-ups and big players that it would be quite time consuming to get a quality/price analysis.


For the "dialect" of spanish that we speak in Argentina, Watson misses every single word. So, to me, CMUSphinx is valuable in that it allows me to tweak it, while IBM miserably fails at every word. Must've been trained with Spain or Mexican "neutral" spanish.

Googles engine also works fine (have been trying it with the phones), but the pricing may or may not be a deal breaker.


Is Julius really state-of-the-art? Looks like they use n-gram and HMMs. Those were the methods that achieve SotA 5+ years ago. My understanding is that Google and Microsoft are using end-to-end (or nearly) neural network models; these outperformed the older methods a few years ago. Not sure how CMUSphinx works under the hood.


They might not be considered state-of-the-art (if you consider both approaches in the same category), but they are definitely one valid approach to voice recognition, which works surprisingly well.

CMUSphinx is not a neural network based system, they do use language and acoustic modeling.


check out https://github.com/yajiemiao/eesen for LSTM and CTC based library instead of HMMs


CMUSphinx is really easy to set up, and then being able to train it for one's specific domain probably beats state of the art with one-size-fits-all training.


> It is amazingly easy to create speech recognition without going out to any API these days.

Not really. The hard part is not the algorithm, it is the millions of samples of training data that have gone behind Google's system. They pretty much have every accent and way of speaking covered in their system which is what allows them to deliver such a high-accuracy speaker-independent system.

CMUSphinx is remarkable as an academic milestone, but in all honesty it's basically unusuable from a product standpoint. If your speech recognition is only 95% accurate, you're going to have a lot of very unhappy users. Average Joes are used to things like microwave ovens, which work 99.99% of the time, and expect new technology to "just work".

CMUSphinx is also an old algorithm; AFAIK Google is neural-network based.


Eesen looks promising, uses LSTM and CTC rather than older tech.

https://github.com/yajiemiao/eesen

Baidu open sourced their CTC implementation

https://github.com/baidu-research/warp-ctc

I think we will have an easy to install OSS speech recognition library and accurate pretrained networks not far off from Google/Alexa/Baidu, running locally rather than in the cloud, within 1-2 years. Can't wait.


From the Microsoft Project Oxford Speech API link:

Speech Intent Recognition

... the server returns structured information about the incoming speech so that apps can easily parse the intent of the speaker, and subsequently drive further action. Models trained by the Project Oxford LUIS service are used to generate the intent.

Do others offer something like this?


Microsoft LUIS is almost identical to the intent classification and entity extraction in the Alexa Skills Kit, but it's easier to use because you can pipe in your own text from any source instead of having to use a specific speech recognition engine. LUIS also has a pretty nice web interface that prompts you to label utterances that it's seen that it had trouble with.


But Google's is by far the best.


Exactly, this is what people are missing when they are trying to compare Google's speech recognition to other services. Google uses deep neural-networks to continuously train and improve the quality of their speech recognition, they get their training data from the hundreds of millions of Android users around the world using speech-to-text every day. No other company has a comparable amount of training data, continuously being expanded. http://googleresearch.blogspot.ca/2015/09/google-voice-searc...


Kaldi is probably the best option now. https://github.com/kaldi-asr/kaldi


Interesting - I saw this as a defensive response to the rising number of developers using Amazon's Alexa APIs, rather than anything related to Nuance.


It's probably been on their roadmap for a while, before Alexa came out. Re: Alexa/echo - think there is an opportunity for someone to manufacture cheap USB array mics for far field capture.

Still, having this paid and cloud based puts a limit to types of things where you'd use it. I will use it in my own apps for now but will swap to a OSS speech recognition library running locally as soon as one emerges that is good enough.


You're right - this could lead to a lot of new innovation as a bunch of developers who wouldn't have bothered before can now start hacking away to see what they can do.

I've been thinking a lot lately about where the next major areas of technology-driven disruption might be in terms of employment impact, and things like this make me wonder how long it will be before call centers stacked wall to wall with customer service reps become a relic of the past...


If it's anything like Googles other APIs, people will build applications on top of it, and then Google will decide to shut down the API with no notice.

Fun to play with, but don't expect it to last...


That's incorrect. This is a Google Cloud Platform service and when it reaches General Availability (GA) it will be subject to our Deprecation Policy. Just like Compute Engine, Cloud Storage, etc. requires us to give at least a 1 year heads up.

Disclosure: I work on Compute Engine.


It's nice there's a policy around that, but I can understand the fears of someone considering using this to start a product - or even worse, a business.

Google has an history of shutting down useful products, why people should trust that one for long term integration?


Because we don't have a history of violating promises like this when done in writing? Seriously, I'd love to call us just "Cloud Platform" so you don't have to think "oh yeah, those guys cancelled reader on me" but if you look at the Cloud products we don't play games with this (partly because we hold ourselves to our binding Deprecation Policy, but mostly because we really care).


Google Search API, Autocomplete, Finance, Voice all closed with tons of active users. I'm not blaming Google; they were acting in their best interest, but the consequence is less enthusiasm for building software that depends on their APIs.

IMO a better option for Google, when considering to close an API, is to enforce payment and hike the price enough to justify maintaining it. If and only if enough users drop out with the higher price, then shut it down for good.


Even outside of services with a formal deprecation policy, Google rarely shuts anything down with no notice (their frequently cited shutdowns had long notice.)


Has Google sued many of their smaller competitors out of business?


No, and they dont need to. They have too many other advantages -- low customer acquisition cost via already-present cloud customers, economies of scale, ease of hiring the best talent, natural integrations via their android platform...who needs nastiness when you have all these amazing benefits!


Isn't Google's strategy typically to just start offering the same services as their smaller competitors but for free and then let them starve? ... Kind of like what's probably happening here? Sounds like this is terrible news for Nuance, for example.


No




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: