How about a lookup table instead of scaling via multiply?

How about a lookup table instead of scaling via multiply? Even with a Uno running 240 pixels I can certainly afford a 256 byte table. This allows arbitrary remapping of input to output values, which can mimic the current scaling, the video-scaling, the add input option, the add 255 option, and the add 254 option we’ve discussed. More importantly, it also allows real gamma correction, not just scaled squaring.

There would be a few ways to use this. One could supply a given table to the library, which could have come from PROGMEM or over a communications link, or from custom calculations. And there could be a utility function to generate the table from a scaling factor (using any of the above variants). If you want to rapidly scale a pixel string, there would be more overhead in regenerating a new table for each refresh (at a new scaling factor), but in most cases that could be acceptable.

Can you create a zero-time-cost lookup table?

One complication might be that if you have to squeeze out every cycle, it could be very useful to align the lookup table on 256 byte boundaries, which could be a little tricky in terms of heap space allocation.

There are a couple of good ideas in this neighborhood: arbitrary ‘grayscale’/‘gamma’ mapping curves, and (my favorite) “indexed color”.

In the “indexed color” scenario, you’d make an-up-to-256-entry color lookup table (CLUT), where each entry was an RGB color. Then you could specify your led pixels with just one byte each – the value would represent an index into the CLUT.

We’re going to be playing around with these ideas, and see which of them we can implement with “zero runtime overhead” – but that work is going to come in July.

Sounds interesting. I’m rooting for the 8bit to 8bit lookup table option (for gamma and scaling), whether or not you do the color lookup table as well.

@Kasper_Kamperman is also a fan of using a proper gamma table, and it’s hard to argue with the ‘correctness’ of it. On the other hand, I’m pretty sure we don’t have the cycles to do three (R, G, B) table lookups per pixel for brightness scaling with zero runtime overhead. Anyway, we’ll see if any new ideas surface before we get back into that part of the code next month.

If the LUT is 256 byte aligned with the high byte of the addrss in R27, what about:

LD R26, Y+ to fetch the next byte
… intersperse something else if you need to
LD R26, X replace with lookup table version
… use the value

In other words, can you replace the two cycle multiply of each data byte with a two cycle lookup table fetch for each byte? (Your code may differ from the above sketch because of your RGB ordering flexibility)

That’s the idea, yep.

Speak for yourself about affording a 256 byte lookup table - I push 500-600 leds with an uno at times, and that’s pushing my ram limits as it is :slight_smile:

The problem with 3 table lookups isn’t just cycles, it is registers. There’s only 3 registers on AVR that can be used for memory lookups - and one is focused on iterating over your data :slight_smile: And not all load operations are available with all theree registers (thank you Atmel, by which I mean, die in a fire :slight_smile:

If we can get the register juggling right this is an interesting option, though (also randomly, keep in mind progmem reading is an extra cycle), however I’m not sure how to force something to be 256 bytes aligned on avr/gcc.

Also, as it is, i’m running into register limitations on ARM trying to get multi-ws2811 output working - I’m not sure I have a register to spare for the brightness lookup table. (And I want to be cautious about getting into the “you can use this function with these chipsets but not with those”). Some things may be better done outside the show loop.

I understand. I don’t expect to get everything I want :slight_smile:

Running 600 pixels (1800 channels) on a Uno is certainly stretching it and would not leave room for a lookup table. Not suggesting that the LUT version is the best for every situation. But a Mega clone with 8K RAM is only $15 delivered now, so typically a system with $400 of pixels, power supplies and cabling can afford to upgrade to one.

Just to clarify - my preferred option needs only one lookup table - it’s the 8 to 24 bit CLUT that takes three.

PROGMEM would be one of several sources for loading a RAM based lookup table, not suggesting using it during the display loop. Often the lookup table can be loaded only once for gamma, or only occassionally.

I do understand that every new function (like the option of using a gamma lookup table in place of the multiply) makes your multi-platform development more complex. I hope that the ARM has enough spare cycles to reload registers from memory more often than one can afford to so in an AVR - but y’all are the experts on that so I’m just guessing/hoping.

As for allocating an aligned lookup table (if the cycles are needed), that would take some experimenation. For the Arduino for example, I don’t know the memory manager well enough to resolve it out of hand. But for initial testing one could allocate a 512 byte buffer, and then use the 256 bytes within that which are aligned on a 256 byte address boundary.

Later I am thinking there could be approaches which allocate and free blocks to get just the desired size without too much fragmentation.

Anyway - this is all for July or later (if ever).

Like @Mark_Kriegsman mentioned I used a LUT for the gamma correction. I 've implemented it in my little MoodLight library (which is just a LUT and HSV2RGB conversion code). I think you can easily combine it with the FAST_SPI 1/2 library. Of course this is all not optimized for speed, but for a lot of cases sufficient enough.