So I recently upgraded to a Bowden setup on my delta and have been trying to push the boundaries of [my] print speed. Unfortunately I seem to have hit an issue with the ability of Redeem to feed Gcode to the path planner fast enough to keep up with the desired print speed. I have played around with the planning buffer settings but haven’t had much luck. Symptoms are stop-start printing while the data transfer rate (shown in Octoprint) is approximately constant. Trying to run at around 150mm/s. Anyone else had this? Should we be looking at further optimizations of the python Gcode pre-processor or was this already done to satisfaction by Anthony Clay? Or am I looking in the wrong direction?
PS: I can move this discussion to the Redeem repo if you like.
Daryl, this is interesting! First of all, there is profiling built into redeem, try starting it from the command line, with argument “profile” and you should get a list of what uses time when you top the program with ^c. The actual reception of the G-codes through Ethernet can be optimized if you use that, for the “Pipe” class, IDK. Also, the Path class is a bit bloated, ideally it should be completely without dictionaries I think, relying on lists/tuples wherever possible. Also, path segments are always 1mm for delta, this is in contrast with less powerful firmwares that will cut the path into as many segments as possible give the time it has to do the calculations. delta calculations are done in C, and pretty optimized, I think, so that might not be the first place to look.
It’s always good to profile before blind optimisation so thanks for the pointer! I may also try to feed the gcode directly to redeem without using octoprint to see if that is the bottleneck.
Running a snippet of the offending gcode reproduces the same symptoms without using Octoprint. Logging with yappi gives the output in the following link:
http://pastebin.com/raw/gknwKBaJ
It appears that the Delta calculations are chewing up a fair bit of time. Maybe a port of the Delta class to C++ is in order?
So I have implemented all the delta calculations in a C++ module. Testing via a python script (a modification of test_redeem.py) indicated that the problem was solved. Before I made a pull request I packaged up my changes into a .deb and installed it. I then tested through octoprint and the stop - start symptoms returned! More investigation required before it gets back into the main repo I think…
I am thinking that for high speed processing with a delta printer we may need to wrap the entire motion control system into C++. Is this something that would be acceptable? I know that for most people this is probably unnecessary and having it in python is significantly easier to maintain and develop. What do you think Elias?
Daryl, I think it is a good idea to be able to put anything that requires optimization into C++, and Python has a lot of way of doing that. The delta calculations are one thing that I think would be ideal for it, so that patch is very welcome. I think it is a big change though, so I’ve merged develop into master so we can play around with it in the develop branch and still be able to patch stuff in master, so you can just submit a pull request and I can merge it into develop 
@Elias_Bakken , @Daryl_Bond , maybe a stupid idea, but… what if we built a “reslice” that could parse the g-code before it was run, and transform the G1 moves from head coordinates into motor (sort of cartesian) coordinates? Might take a bit longer, but it could somehow run on the fly, and buffer things for a while during printhead and bed warmup… Then driving each motor with absolute position for each of them in cartesian mode. Maybe a silly idea.
It’s not a bad idea, but wouldn’t that only add the time spent before the print starts instead of doing it while printing? I think the right thing to do is to find the culprit that is causing the delay. I mean there is 1GHz with an FPU here, should be just a matter of tweaking a small part of the source code
If a 16 MHz 8 bit avr can do it, how hard could it be?
Elias, maybe. But a 16MHz 8bit avr can only do it so fast. And ARM chips also have an inherent slowdown in the architecture when it comes to floating point calculations… Which the delta transformations are. I agree it could be a problem if it needs to scan the whole file ahead of time… But does it? What if we just had a buffer with the “corrected” commands that gets filled on the fly by a separate thread that starts as soon as you hit print? The buffer would get filled up during the warmup of printbed and hotend and keep filling while the print is ongoing. But with the buffer to keep the printer busy, on-the-fly compute speed becomes less critical, as long as there’s enough data in the buffer.
The order of operations in my C++ port (still a WIP), which follows the python code pretty closely, is as follows:
- Import a move command. This is in world space.
- Apply soft limits and bed compensation.
- If required split the command up into segments.
- Transform the x,y,z component of the move into joint coordinates.
- Figure out the discrete stepper motor movement.
- Transform back into real world coordinates, but keep a copy of the stepper motor moves.
- Update our machine state (its position) with the real world coordinates and feed the stepper moves to the PRU queue for eventual dispatch to our motors.
So, if I am interpreting your proposal correctly, what you are talking about splitting out is from step 1 to step 6. From step 4 to step 6 is the expensive part, but only really expensive for the delta platform.
We are already buffering to some degree before the print starts, but the bigger the buffer the longer the delay between user input and response. Inserting another buffer, say after step 6, would probably allow for faster execution with shorter delay (have a massive first buffer but only a small second buffer). You would also have to allow for a means of bypassing the first buffer to allow interrupts. Every time you interrupted you would have to re-calculate the first buffer as all the acceleration planning would be off.
Overall I think that it is probably not worth the extra complexity. The existing python code works very well up to reasonable speeds. I am trying to print at over 150mm/s on a delta printer, which gives the heaviest load, and I am only just hitting the limit of what the python code can do. So porting all of the steps I listed above to C++ should speed things up tremendously and push that processing limit out beyond what anyone would expect to be able to use.
Don’t get me wrong, I think that your idea is a good one and quite a valid suggestion! I just don’t think it is worth the effort when a C++ implementation of the existing method should be way more than enough. These could be famous last words so stay tuned!
Daryl, great sum up! I’ll copy-paste that explanation into the wiki, under “advanced” as a starting point for anyone else who wants to contribute to the delta. Have you narrowed the expensive parts down so you start optimizing what is most expensive first? I would imagine just getting the stuff in “Delta.py” strongly typed would make for a big speed up, but it’s up to you how you want to do it!
Thanks Elias. My first attempt was a port of the guts of Delta.py into C++ which helped a lot. It didn’t seem to be enough, however, so now I am moving all the heavy lifting from Path.py into NativePathPlanner.
My idea is that anything that requires actual number crunching should be inside NativePathPlanner (the C++ module). Anything that is just organising move commands, like homing, should be in the python scripts.
Thanks for the sum up Daryl. It makes sense, I guess I’m just afraid of how expensive those floating point calculations still are for an arm architecture. But I’ll freely admit I know there was a patch in the arm v6 or 7 to address it and I don’t know how well it works or even if the BBB has it or not. One thing I’d do to test this out before optimizing would be to simply convert from meters to, say, micrometers or even smaller, perform the computation there with rounded values and see if the computation speed jumps up or not. This might be better done in c++though.
I got a little curious, and google delivered an interesting benchmark: https://learn.adafruit.com/embedded-linux-board-comparison/performance I think in our use-case though, we ought to consider just how precise the stepper motor can actually get, and use an integer rounding to a level just below that.
Thanks for the link Jon, I hadn’t realised the BBB FP performance was on par with the RasPi.
Yeah this was really interesting! @Daryl_Bond I think a way to push the whole delta segment slicing into C would be a great speed up. So simply pushing all segments into the path planner. The Path.py class is quite big and uses dicts for input which I’m expecting adds a lot of overhead. Is this what you are doing?
I will see if I can explain the process that I am implementing by tracking what happens when we make a G0 call.
- Call G0
- Construct a Path instance. This path is just a container for ideal start and end positions and relevant options. There is no modification of the passed in values done in python.
- Call PathPlanner.add_path() on the path we just made. This calls native_planner.queueMove(…) which takes the start and end position plus a bunch of option flags which are in the path instance.
- Perform all the steps I described earlier in this thread in C++.
- Return.
The Path.py class is now simply a useful container for handling the un-split paths before passing them off to be implemented by the path planner. Python is really good at doing this sort of organisational job and, as pointed out by Elias, it is done with dicts at the moment. As long as we aren’t manipulating these dicts too frequently then there shouldn’t be too much of a performance hit and we retain flexibility.
The printer state (all the axis positions) is now held within the NativePathPlanner instance. There is a function for querying this so we can get it back into python. A G92 call now just sends a setState(…) call to NativePathPlanner.
Delta.py is now just a container for parameters that have been passed in via the cfg. There is Delta.cpp which contains all the necessary transformation functions. NativePathPlanner holds an instance of this which is initialized at the same time as native_planner in PathPlanner.py
NOTE: This is just what I have at the moment. I haven’t tested it yet so things may change drastically!
I am working in the following repository if you want to take a look. I’ll get it working before moving a cleaner version over to the develop branch and raising a pull request.
https://bitbucket.org/daryl_bond/redeem/branch/bushbash
If you are wondering about the branch name, I am Australian and to “bush-bash” means to push through heavy foliage. I realised after I named it that for non-Aussies it may seem a bit inappropriate!
Especially since bash in Norwegian literally mean poop.
So I now have the path planning taking less time to execute than the reading and dispatch of the g-code i.e. G0.execute() now takes up more time than PathPlanner.add_path(). I am able to run my test code without hiccups with ‘max_buffered_move_time’ less than half a second, as long as it is through test_redeem.py. If I try to run the same gcode through Octoprint or Pronterface I get the dreaded hiccups again. I think this is because the test g-code has sections which are made up of many VERY small moves which is overloading OctoPrint’s ability to push it to Redeem. So not a Redeem problem and therefore out of my scope.
I think a good project that would fix this issue would be a g-code pre-processor plugin for octoprint. This could process any code and combine very small moves according to a user defined tolerance. This should reduce the number of moves that have to be communicated and so avoid the problem.
Overall, however, I think I am in a position to clean up the code and raise a pull request very soon!