Very interesting comments. Yes absolutely want to do comparisons to ELMo. It's a little tricky to do so on our datasets, since ELMo isn't really a complete method on its own, but more an addendum to existing methods. In the future we hope to do seq2seq and sequence labeling studies, and we can then ensure we pick datasets that the ELMo paper covered.
Using char tokens can definitely be helpful, as can sub-words. It's something we've been working on too, and hope to show results of this in the future.
I mainly disagree with your view of end-to-end training. In computer vision we pretty much gave up on trying to re-use hyper-columns without fine-tuning, because the fine-tuning just helps so much. It's really no trouble doing the fine-tuning - in fact the consistency of using a single model form across so many different datasets is really convenient and helpful for doing additional levels of transfer learning.
Thanks for the note about table 7 - it's actually an error (it should be 5.57, not 4.57; Sebastian is in the process of uploading a corrected version).
Perhaps even more interesting than comparison would be modifications to ULMFit to incorporate good ideas from the AllenNLP ELMo paper.
The learned weighting of representation layers seems like a decent candidate, as does giving the model flexibility to use something other than a concatenated [mean / max / last state] representation of final LSTM output layer (as is the case in some of ELMo's task models). I'm personally curious about using an attention mechanism in conjunction with something like ELMo's gamma task parameter (regularizer) for learning a weighted combination of outputs but haven't been able to get things to function well in practice.
The dataset the ELMo model is trained might also be preferable to WIKI 103 for practical English tasks, although you lose the nice multilingual benefits you get from working with WIKI 103.
In general it seems like the format described in the ELMo paper is simply not designed to work at very low N because the weights of the (often complex) task models used in ELMo's benchmarks are learned entirely for scratch. That's not possible without a decent amount of labeled training data.
Anyhow, thought the paper was very well put together, definitely an enjoyable read. Hope yourself and Sebastian collaborate on future papers, as good things certainly came of this one!
I just wanted to clear up my comments on fine tuning. These LMs are huge. The ELMo paper has 300 dimensional embeddings, yours has 400 (which, btw, should probably be controlled in a comparison). As an engineer, I don't really want to deploy a fine tuned LM for every task I have. Especially on smartphones, I can barely deploy one of these.
The obvious answer is that I should just train a single joint model.
That's great, but when you retrain a model, even if you get similar accuracy, your actual predictions change. It's basically why same model ensembles help.
So if I am trying to improve predictions for a single task, but I have a joint model, then I have to deal with a whole pile of churn that I wouldn't if I had separate models.
This doesn't show up in academic metrics, but people care when things that used to work stop working for no real reason, even if an equal amount of new things started working.
So, I'm not saying we shouldn't fine tune things, it's that I have a set of engineering challenges that make fine tuning less ideal, and I am curious how much we can get away with sharing. There are plenty of CV papers which indicate that the very first layers basically don't benefit from fine tuning because they are so general. Is that true for NLP as well, or are words embeddings already quite domain specific?
Using char tokens can definitely be helpful, as can sub-words. It's something we've been working on too, and hope to show results of this in the future.
I mainly disagree with your view of end-to-end training. In computer vision we pretty much gave up on trying to re-use hyper-columns without fine-tuning, because the fine-tuning just helps so much. It's really no trouble doing the fine-tuning - in fact the consistency of using a single model form across so many different datasets is really convenient and helpful for doing additional levels of transfer learning.
Thanks for the note about table 7 - it's actually an error (it should be 5.57, not 4.57; Sebastian is in the process of uploading a corrected version).