Very interesting comments. Yes absolutely want to do comparisons to ELMo. It's a...

madisonmay · on May 16, 2018

Yes absolutely want to do comparisons to ELMo.

Perhaps even more interesting than comparison would be modifications to ULMFit to incorporate good ideas from the AllenNLP ELMo paper.

The learned weighting of representation layers seems like a decent candidate, as does giving the model flexibility to use something other than a concatenated [mean / max / last state] representation of final LSTM output layer (as is the case in some of ELMo's task models). I'm personally curious about using an attention mechanism in conjunction with something like ELMo's gamma task parameter (regularizer) for learning a weighted combination of outputs but haven't been able to get things to function well in practice.

The dataset the ELMo model is trained might also be preferable to WIKI 103 for practical English tasks, although you lose the nice multilingual benefits you get from working with WIKI 103.

In general it seems like the format described in the ELMo paper is simply not designed to work at very low N because the weights of the (often complex) task models used in ELMo's benchmarks are learned entirely for scratch. That's not possible without a decent amount of labeled training data.

Anyhow, thought the paper was very well put together, definitely an enjoyable read. Hope yourself and Sebastian collaborate on future papers, as good things certainly came of this one!

Eridrus · on May 16, 2018

I just wanted to clear up my comments on fine tuning. These LMs are huge. The ELMo paper has 300 dimensional embeddings, yours has 400 (which, btw, should probably be controlled in a comparison). As an engineer, I don't really want to deploy a fine tuned LM for every task I have. Especially on smartphones, I can barely deploy one of these.

The obvious answer is that I should just train a single joint model.

That's great, but when you retrain a model, even if you get similar accuracy, your actual predictions change. It's basically why same model ensembles help.

So if I am trying to improve predictions for a single task, but I have a joint model, then I have to deal with a whole pile of churn that I wouldn't if I had separate models.

This doesn't show up in academic metrics, but people care when things that used to work stop working for no real reason, even if an equal amount of new things started working.

So, I'm not saying we shouldn't fine tune things, it's that I have a set of engineering challenges that make fine tuning less ideal, and I am curious how much we can get away with sharing. There are plenty of CV papers which indicate that the very first layers basically don't benefit from fine tuning because they are so general. Is that true for NLP as well, or are words embeddings already quite domain specific?