Unless the firmware programmer is completely incompetent, using layers would be just one or a few indirect lookups. Yes, transparency would add more, but that would still be maybe only a dozen more CPU instructions. The added latency would be measured in tens of nanoseconds, not milliseconds.
Compare that to debouncing, which is at least five milliseconds and often more to account for the odd bad switch.
You'd be surprised! I know I was when I discovered how slow layer lookups can be. You need to compare the keycode, a 16-bit integer, typically on a 8-bit MCU, so that's going to take more than one instruction. To do this, you will first have to look up the key, which is in PROGMEM, so it's a little more involved than just reading from memory (1 more cycle, as far as I remember). Then do this for every layer, top to bottom, until you find a non-transparent one. So all of the previous, times 32 in the worst case. For every single key.
For comparsion, if we take lookup from PROGMEM as baseline, and assume a keyboard with 87 keys, a naive implementation of layering will do this lookup 31 times more than without layers. Even if this'd be a single instruction, we are still looking at 31 more, and they are much more expensive than that. Luckily, almost no firmware I know of does this, but few manage to do it fast. Just looking up from PROGMEM, we are looking at almost 2700 extra cycles. Add the comparsion, the loop increment, the stop-condition check, and we are easily looking at ~10k extra cycles, which is getting into millisecond territory. A single key pressed may not take all that much time, but when you keep hammering on the keyboard, it adds up. Especially if you want to handle not just press & release, but all states, including being held and continuing to be released, because in this case, you can't cheat by only doing the lookup for keys that were pressed or released - you have to do it for all of them.
For example, QMK (and if I remember correctly, TMK too) will do a full scan over all 32 layers on each keypress. They'll cache the results until the next press of the same key, so releasing the key, or looking it up while held for whatever reason will be a much cheaper operation. But on keypress, it still goes through every layer, with all the penalties applied. Even for a single key press, we are looking at well over a hundred instructions, considerably more than a few dozen. Pressing more keys, or wanting to handle not just keys that changed state, but those that remained at their previous one will quickly push the cost up high.
In Kaleidoscope, we take this a few steps further, and we do more aggressive caching (not going into details here, poke me if interested), so key lookup becomes a single load from SRAM, regardless of how many layers one has. Going from naive to single load from SRAM got us down from >10ms main loop cycles (matrix scan & event handling, debounce not included) to ~2.7ms loops.
In short, looking up keys on a layer has non-trivial costs. It may be a few dozen instructions per key, but when you have to do that for every key on the keyboard, each cycle, there will be a noticable impact. If you add some special flavour (oneshot keys, macros, and so on), you are looking at even higher costs, all of which the article doesn't care about, thus making the comparisons a lot less useful.