How about a lookup table instead of scaling via multiply? Even with a Uno running 240 pixels I can certainly afford a 256 byte table. This allows arbitrary remapping of input to output values, which can mimic the current scaling, the video-scaling, the add input option, the add 255 option, and the add 254 option we’ve discussed. More importantly, it also allows real gamma correction, not just scaled squaring.

There would be a few ways to use this. One could supply a given table to the library, which could have come from PROGMEM or over a communications link, or from custom calculations. And there could be a utility function to generate the table from a scaling factor (using any of the above variants). If you want to rapidly scale a pixel string, there would be more overhead in regenerating a new table for each refresh (at a new scaling factor), but in most cases that could be acceptable.

Can you create a zero-time-cost lookup table?

One complication might be that if you have to squeeze out every cycle, it could be very useful to align the lookup table on 256 byte boundaries, which could be a little tricky in terms of heap space allocation.