I think most people don't realise that "10 million" records is small, for a computer.
(That said, I have had to deal with code that included an O(n^2) de-duplication where the test data had n ~= 20,000, causing app startup to take 20 minutes; the other developer insisted there was no possible way to speed this up, later that day I found the problem, asked the CTO if there was a business reason for that de-duplication, removed the de-duplication, and the following morning's stand-up was "you know that 20 minute startup you said couldn't possibly be sped up? Yeah, well, I sped it up and now it takes 200ms")
I thought you were going to say to reduced O(n^2) to O(n*log(n)), but you just deleted the operation. Normally I'd say that's great, but just how much duplicate data is being left around now? Is that OK?
Each element was about, oh I can't remember exactly, perhaps 50 bytes? It wasn't a constant value, there could in theory be a string in there, but those needed to be added manually and when you have 20,000 of them, nobody would.
Also, it was overwhelmingly likely that none of the elements were duplicates in the first place, and the few exceptions were probably exactly one duplicate.
I'm kind of surprised no one just searched for "deduplication algorithm". If it was absolutely necessary to get this 1MB dataset to be smaller (when was this? Did it need to fit in L2 on a pentium 3 something?), then it could probably have been deduped + loaded in 300-400ms.
Most engineers that I've worked with that die on a premature optimization molehill like you describe also make that molehill as complicated as possible. Replacing the inside of the nested loop with a hashtable probe certainly fits the stereotype.
> I'm kind of surprised no one just searched for "deduplication algorithm".
Fair.
To set the scene a bit: the other developer at this point was arrogant, not at all up to date with even the developments of his preferred language, did not listen to or take advice from anyone.
I think a full quarter of my time there was just fire-fighting yet another weird thing he'd done.
> If it was absolutely necessary to get this 1MB dataset to be smaller
It was not, which is why my conversation with the CTO to check on if it was still needed was approximately one or two sentences from each of us. It's possible this might have been important on a previous pivot of the thing, at least one platform shift before I got there, but not when I got to it.
I think most people don't realise that "10 million" records is small, for a computer.
(That said, I have had to deal with code that included an O(n^2) de-duplication where the test data had n ~= 20,000, causing app startup to take 20 minutes; the other developer insisted there was no possible way to speed this up, later that day I found the problem, asked the CTO if there was a business reason for that de-duplication, removed the de-duplication, and the following morning's stand-up was "you know that 20 minute startup you said couldn't possibly be sped up? Yeah, well, I sped it up and now it takes 200ms")