Now you've doubled your memory usage. That is not necessary.

Computers do things at a set speed. If you want fast code, you have to tell the computer to do fewer things. That means you need to be efficient: get a lot done with fewer instructions and less waste. If you just minimize the wasted effort you'll end up in a good spot.

That's going to be hard to do if you don't know what instructions actually are and don't understand basic computer architecture.

It's also why "threading will make it go faster" is so fallacious. Threading doesn't make things faster, per se, it just attempts to minimize latency by using more resources while often introducing overhead that decreases total system throughput. Using 8 cores to get a 6x speed up might seem like a good idea, until you realize you're introducing latency to every other process and your total CPU time is 33% higher. This also introduces all sorts of weird cache effects which can end up violating assumptions relevant to performance in both your code and other code running on the same system. Unless you need to minimize latency, focus on writing the best single threaded code you can.

Do not write performance sensitive code, such as image processing primitives, in Python. Write it in modern C++. You can call your compiled C++ function from Python. You can reasonably expect to see an order of magnitude improvement just from doing this. It also opens all sorts of doors when it comes to performance tuning.

Allocating and deallocating memory (I'm not very good at Python but I think that's what you're doing with temp=[]) takes time. You know what takes zero time? Declaring an array with a certain fixed size, up front. Alternatively you can use a std::vector<> and immediately .reserve() the size you need. (Don't use .resize() unless you need it.) At the very least you don't need to clear and reappend to your buffer.

You do not need to sort the elements within the 3x3 window. You need to find their median. nth_element() can do this. As can a heap of 5 elements (for a 3x3 filter) you conditionally insert into. (That is the solution to a classic interview question, actually.)

It is unlikely that .sort() is optimized for your specific case. With so few elements it is likely that something like insertion sort, or even bubble sort, will be faster.

Your current formulation can not take advantage of SIMD, which is leaving a lot of performance on the table. Since images are often single channel, 8 bit resolution and common vector sizes are 128 bits you could be leaving 16x performance on the table by not exploiting SIMD instructions. If you don't know what SIMD is you need to go fix that.

(Nitpick: "filter_size // 2" is correct, but why not just do a bitshift? I'm not sure about Python's ability to make the conversion for you.)

You are biasing the filter near the border of the image. By inserting zeros into your buffer instead of only looking at valid pixels you are biasing your filter towards darker values at the borders. You could do some tricky things to find the median of the valid pixels only, but I would recommend just not having the filter be defined there. In computer vision maximizing reliability is often a core focus, so it is often better to just let the output be smaller than the input. Zero-bias error is a really, really nice property to have: don't accidentally lose it over something so trivial.

I'm not that savvy with Python but I'm pretty sure that in "for j in range(len(data[0])):" the len() is being evaluated in each iteration of the "i" loop around it. Compute this once and cache it.

You have multiple if statements in your inner loop. You are guaranteeing that you will get multiple branch mispredictions here. Even if you somehow avoided them, you're checking for an edge condition on every single pixel.

There are a couple of ways to avoid your boundary conditions. The most obvious is to just zero pad your data. This is what most people do, and it can be the right thing. But it makes you use more memory and can introduce an image copy. What I like to do is explicitly write the boundary conditions, then go into a loop for the bulk of the image. This increases lines of code but you don't have to compromise on performance.

I had to solve a similar problem recently. It was single channel, uint16_t data from a very noisy sensor on a system with 128 bit vector width. I needed a 5x5 median filter and decided to use the median of medians approach. Median of medians gives a result whose position is guaranteed to be within 10% of the position of the median in a sorted list. That is, for a list L of size S which has been sorted to give a list K it will return an element between K[0.4*S] and K[0.6*S]. Here is how I implemented it:

The image width size was already a multiple of the vector width. I created a buffer of size 5*row_width. I treated this as a cyclic buffer of 5 rows (such that row n would be evicted once I added row n+5 to it). I was provided a separate output buffer.

Before I tell you the next part, realize that C=A<B, where A and B are SIMD vectors, will fill each element of C with all 0 bits or all 1 bits depending on if the result is true or false. This is useful as a bit mask. Perhaps you don't have a vector min instruction and need to synthesize C=min(A,B) like so:

M = A < B;
C = (A & M) | (B & (~M));

I first prefill the buffer with 5 rows which have had horizontal median filtering applied. Here is how I do that filtering on each row:

I used 5 overlapping vector loads (unaligned loads are performant on this system) to create 5 vectors of 8 elements each (128/16=8). I then run a parallel median finding network on each (look up "sorting networks"). StackOverflow has some example code:

template<class V>
inline V median(const V &a, const V &b, const V &c)
{
return max(min(a,b),min(c,max(a,b)));
}

template<class V>
inline V median(const V &a, const V &b, const V &c, const V &d, const V &e)
{
V f=max(min(a,b),min(c,d)); // discards lowest from first 4
V g=min(max(a,b),max(c,d)); // discards biggest from first 4
return median(e,f,g);
}

Of course if you are lucky enough to have a med3 instruction you should use that instead of the 3 argument median function.

I write the result to the buffer and skip down 8 elements, repeating this process until I fill a full row into the buffer.

After the initial 5 are filled into the circular buffer, I am then ready to output a row of final results. I do this 8 at a time by loading from each of the 5 rows in the circular buffer and running that through the same median finding network. The result is written back in place to the input image. This introduces no RAW hazard because I am reading from 2 rows below it.

I then add another row to the buffer, and then immediately compute one more row of final results (as in the previous paragraph). This continues until I run out of output rows.

Of course I also tweaked how the loops were unrolled.

(Actually, I interleaved these horizontal and vertical median finding operations so I could trade some register pressure for better performance by dodging some vector loads. I only bothered because I was already used to writing ASM on this platform.)

This runs at full resolution, high frame rate on a single (<1 GHz) core while leaving plenty of time for the rest of processing of that frame before the next one comes in. It's runtime is within a few percent of optimal. I haven't timed your code but I'd be willing to bet it is more than 100x slower.

I suggest learning at least the basics of computer architecture if you want to write performant code. Tools like Godbolt are indispensable. (See also.) You're likely not getting within an order of magnitude of optimal if you stick with Python.

Related categories:

Reddit mentions of Digital Computer Electronics

idea-bulb Interested in what Redditors like? Check out our Shuffle feature

Found 1 comment on Digital Computer Electronics:

Interested in what Redditors like? Check out our Shuffle feature