For basic use, Pocketsphinx.js is easy to set up. The included demo has everything you need to capture audio from the browser and parse it for keywords. Pocketsphinx comes with an overwhelming set of options. I found myself glossing over most of them and copying the defaults from the demo.
The most important configuration is setting up your dictionary. The dictionary tells Pocketsphinx what sound combinations represent each keyword you’re looking for. Sphinx uses the CMU Pronouncing Dictionary, which uses alphabetical codes to represent 39 sounds your mouth makes to produce words. For example: “EY” represents a hard A sound and “T” represents a T sound, so “EY T” represents the sounds which make up the word “ate”. The dictionary web site has a search engine to look up representations for words, or you can write your own. There is lots of room for conflicts in homonyms. The previous example EY T could mean “eight” as well as “ate”. You need to be wary of this in your programming. It’s not perfect either, at one point I had to remove the word “right” from my dictionary because 90% of the time is was recognized as “flight”.