I have been working on accessing the from userspace by memory-mapping the SPI registers,

I have been working on accessing the #beagleboneblack #SPI from userspace by memory-mapping the SPI registers, and I wanted to make a post about it to see if anyone else has been working on something similar so that we can compare notes. This work comes from my observation that, when using the spidev driver, the ioctl() call to TX/RX a SPI message (SPI_IOC_MESSAGE()) takes far longer than the actual time spent performing the TX/RX. I have always mmap()'d the GPIO registers to interact with the GPIO pins, so why not do the same thing with the SPI registers?

I’m still using the spidev driver, but not for the actual transmission. I set up the device tree overlay to mux for SPI0, and I open and configure the SPI0 channel via ioctl() calls to “/dev/spidev1.0”. I also mmap() the SPI registers (MCSPI_CH0CONF/STAT/CTRL, MCSPI_TX0/RX0, etc.) by opening “/dev/mem” and mapping a single 4K page at the base address of 0x48030000. For each message transmission, I set up MCSPI_CH0CONF for the various transmission settings and to lower the CS signal, poll the CHSTAT RX and TX status bits in MCSPI_CH0STAT, and write/read the MCSPI_TX0/RX0 registers to handle the data transmission. After the transmission is done, the CS signal is raised again via MCSPI_CH0CONF.

Overall, it works well. I am able to quickly change transmission modes (POL/PHA), speeds (48MHz clock divider), word size, etc. without the ioctl() calls, and I can gate GPIO pins with the CS to talk to multiple SPI slave devices. Since I can control the speed and configuration on-the-fly quickly for each message, I can talk to many different slave devices at different speeds on different messages without the overhead of the ioctl() calls to change the channel settings.

Looking at my scope’s output, I can see that the signal looks good (even at the full 48MHz clock speed). I’ve found that if you slam the MCSPI_CH0STAT by polling it over and over, you will receive a bus error. But, if you poll only 500 or 1000 times before timing out, it works out better. Also, you can probably busy-wait a bit more between checks. You’ll want to usleep() immediately prior to sending your message(s) to ensure that you’re sending your SPI data from userspace at the start of your current timeslice. You could use sched_setscheduler() to boost the priority of your SPI comm thread to real-time priority, but I didn’t see the benefit in this. If you need real-time guarantees, you can always use the PRU to talk to the SPI registers.

The spidev driver will outperform the userspace approach for larger transmissions. In my measurements, the break-even point for an 8-bit word message sent at 12MHz is about 250 bytes. At 50 bytes, it takes 90us in userspace and 140us in the kernel (36% faster). At 100 bytes, this becomes 167us/231us (28% faster). At 160 bytes, DMA turns on inside the kernel driver and the time becomes 262us/321us (18% faster). At 300 bytes, the timing is 475us/456us (4% slower). But, userspace will still allow you to implement multiple chip select lines via GPIOs and quickly change transmission parameters between messages.

42239aa8b811c260ffe34a5565f00ba0.png

How very non-Linux-y of you. I’m curious how the SPI driver writer folks look at this. Do you feel the kernel driver could be better optimized or that this is a don’t-care for most people who demand to use the kernel driver?

BTW, have you show this to @Alex_Hiam for inclusion in PyBBIO?

Very cool, I love a good mmap hack.

@Jason_Kridner I actually just ditched mmap for GPIO from PyBBIO in favor of Kernel drivers to keep it nice and Kernel friendly. (http://www.alexanderhiam.com/blog/pybbio-update-version-0-8-5/)

Perhaps you could help improve the spidev driver with what you’ve figured out?

spidev is a generic interface for implementing a SPI driver in userspace. This is a clean and flexible approach, though it carries the penalty overhead of the ioctl() calls. Any system call carries the penalty of the userspace/kernel context switch, which is one of the reasons why I/O and memory management carry such heavy performance penalties. When you remove the ioctl() overhead, SPI becomes far more efficient. But, you lose that clean interface in the process and must deal with the registers directly.

The functionality that I am using mirrors the PIO path in the driver implementation in drivers/spi/spi-omap2-mcspi.c in the kernel, so the kernel driver devs won’t see much that is new or exciting in my approach. Right now, I’m not using the “turbo” mode for RX, there is no DMA, etc., so the kernel driver will out-perform the userspace approach in some cases. The benefits to what I am doing are the elimination of the ioctl() overhead and using GPIOs to implement additional CS lines. For short messages, the time savings start to stack up.

If engineers are using the BBB as an eval platform for testing the AM3359 as an embedded controller, especially if they are interested in using the PRU for talking SPI without implementing SPI via bitbanging GPIOs, this approach will definitely be of interest.

So, if there are engineers complaining on the BeagleBoard group that complain about SPI performance, this mmap() approach is a possible solution. It can keep up with SPI at 48MHz, and you could queue up a bunch of messages at different speeds to different SPI slaves and fire them out quickly.

This is really the wrong way to go about doing things. If you want higher speed, write a driver for your device. Its like exclaiming how excited you are at how well you’ve gotten at pounding in nails with your head.

Or , if you are convinced that you need to be using spidev for some reason, use the read/write interface instead of the ioctl one. I think you should be able to queue up many smaller operations in one system call.

And of course a quick look at the spidev documentation shows you can issue as many requests with a single ioctl as you want. Hey look, a hammer :slight_smile:

I need to send and receive one short SPI message once per millisecond or so between an application and an external SPI device. I have a variety of other tasks that must also occur during that millisecond within the application. I can’t queue the SPI messages because they occur spaced out over time. If I use my own driver, I am still using either read/write or an ioctl to flip between user space and kernel space to get the data that I need from that driver. My goal was to eliminate that user-to-kernel space context switch, because that represents the majority of the overhead (50us or so, about 5% of the 1ms time budget). I’m not knocking spidev. The TX/RX speed is fine.

If you need it every 1ms, you can specify a delay in the ioctl ‘delay_usecs’.

I don’t know what the contents of the message will be ahead of the time that I need to send it and get the response. While the length remains the same each time, both the TX and RX data will differ on each message. The next TX depends upon the response of the previous RX, and so on. I can’t queue them up. I have to generate a new message each time, which means no piggy-backing multiple messages on one kernel context switch.

@Andrew_Henderson the “Linux-y” way might be to have the driver do the calculation of the next contents, rather than putting that in userspace. You can parameterize your driver as needed.

If it is a low-latency real-time task, you’ll likely get less ire from the Linux community by moving the same task to the PRUs.

I don’t think there is anything wrong with doing a once-off of an mmap() solution. Where things start to go wrong is when others blindly copy your approach.

@Jason_Kridner , thanks. Sorry to have raised a ruckus.

What I’d the acceptable jitter for the task?

It is a pretty small window, since it is interacting with a custom controller board for a mechanical coil. I can’t tell the difference, myself, but sliding maybe 2 or 3 us either way within each 1ms soft realtime event cycle is probably good enough.

The event cycle is 1ms long, with the SPI comms occurring right at the start of the cycle (less likely to be interrupted by a context switch at that point). My concern is that the activity within the cycle will exceed the 1ms time budget. It is pretty close at the moment. There is a lot of processing that occurs during the cycle, some interaction with the framebuffer and ALSA for multimedia feedback, and checking and setting some GPIOs. gprof shows that I’m CPU-bound in areas where I can’t cut corners for most of the cycle’s time.

Any time left after all of that goes to a usleep() until the start of the next 1ms cycle. The problem that prompted all of this fiddling is that there are so many factors for a user space process that occasionally I’d blow past the limit of the 1ms cycle and never even usleep(). Every time that the cycle gets blown, everything slides further behind. If I could knock out 5% (50us) of the cycle’s 1ms limit by eliminating the one SPI ioctl() call, that was well worth the mmap() experimentation.

I’ve tuned most of the other aspects quite a bit for performance using gprof before I started working with the mmap’d SPI. My process is the init process for the system, but I’ve also run the system with the standard init setup as a process nice’d to -20 as well to try and smooth its performance out.

You’d have a much better time of meeting your jitter budget in kernel, but you still may occasionally miss windows. As far as scheduling things to happen in user space, usleep certainly the wrong call to be using.

But yes, this is a very strange hardware setup to have a spi device with a hard window. It may actually make sense here to use the PRU and bitbang (not use the SPI controller), depending on how things are wired up.

you might consider ‘chrt’ http://linux.die.net/man/1/chrt

@Russ_Dill what about assigning the SPI peripheral to the PRU?

@Russ_Dill , I think that I’m stuck with the jitter problem unless I take serious measures to lock down the timing. For now, I think that I can live with some jitter if I can keep everything on-budget with the timing. If I oversleep on one cycle, I undersleep on the next, so it all sloshes around the 1ms mark. Should I be using nanosleep() for this?

@Jason_Kridner , I actually did look at chrt, but by using the sched_setscheduler() call to do the same thing programatically. I decided against it for now because I am just not sure how much I will unintentionally impact other aspects of the system by doing so. I’ll have to experiment with it once the system is “working” to see how I can better reduce the variances from cycle to cycle. I’m always afraid that I’ll lock the system hard by starving something out…

Again, I urge you to consider putting this in kernel, but hey, there are options within userspace to improve timing. use clock_gettime(CLOCK_MONOTONIC, …) with nanosleep. Use sched_setscheduler to set an RT priority. Use mlockall to avoid your text pages from being dropped from caches. However, even on a fast, multi-cpu system, this will give you jitter on the order of 20us.

Thank you very much to everyone for your suggestions and information!