I think you continue to miss both my points. If you're measuring apples falling to the ground, of course you wouldn't expect effect size to diminish with higher N. But social/psychological studies are not like a physics study, you have far less control over variables. This is especially true with a large N, where there is a noisier environment with more variables coalescing, typically, you get less precision over what you can say about impacted individual datapoints.
I wouldn't argue that a larger effect size here wouldn't be more impressive, of course it would. I'm just saying that a small effect size for a study of this kind does not diminish it's meaningfulness and that it's to be expected for these kinds of studies. There's an effect, that we're very confident is real, works in both directions and has real world implications.
A noisier environment doesn't mean you expect smaller effects. It means your measurement is unstable. This problem also occurs in physics; using the standard approximation of gravity of 32 feet per second per second, the extra time required to fall 9 feet instead of 4 feet is exactly 1/4 second. Should you actually try the experiment, you'll quickly notice that your measured time varies from attempt to attempt. There is an office of the government which (among other duties) measures the weight of a coin (the same physical coin) every day, and records the result. Some days are anomalous. There's variation every day.
What larger N does is enable you to see past the noise. With a large sample, the effect of the noise in your measurements diminishes to zero, letting you estimate the effect you're looking for more accurately. So over 200,000 apple drops, I should see an average fall time discrepancy very close to 0.25 seconds; whereas with 2 apple drops, I might for whatever reason measure the time discrepancy as 2/3 second. The 0.7 seconds estimate is way off because of small N.
If, as you work with larger and larger sample sizes, the effect you're measuring recedes steadily to zero, the obvious conclusion is that it's all noise.
However! We started this by talking about a different thing entirely. You say this:
> There's an effect, that we're very confident is real, works in both directions and has real world implications.
This study has immense statistical power and a minuscule effect size. The immense statistical power means, yes, "that we're very confident [the effect] is real". That's measured (from a traditional perspective) by the p-value.
The effect size measures the real-world implications. A very small effect size means that the real-world implications are likewise very small.
As a toy example, suppose I do a study finding that feeding children between the ages of 4 and 7 meat with bones in it vs meat without bones increases their height as adults by three feet (p < 0.9). The real-world implications are huge. Our confidence in the study is low.
The physics measurements are interesting for different reasons. Those are measuring very objective things, even if measurements vary, they vary for ways we can conceivably calculate. Some physicists even raise the idea that certain constants aren't that constant.
But it's still vastly different to appreciate statistical data coming from those kinds of experiments and those that touch on psychology and social effects. Your height/nutrition example is convenient because we all can appreciate an effect expressed in objective units such as cm that we can see with our eyes. It's much harder to weight the effect that, say emotional states, have in pure numbers.
I could continue this discussion endlessly, probably not going to get anywhere with it.
I wouldn't argue that a larger effect size here wouldn't be more impressive, of course it would. I'm just saying that a small effect size for a study of this kind does not diminish it's meaningfulness and that it's to be expected for these kinds of studies. There's an effect, that we're very confident is real, works in both directions and has real world implications.