opengl architecture with greyscale video?

karl krach's icon

hi,
i'm currently doing a project dealing only with greyscale video which is captured from a live input, converted to greyscale and then buffered into a 1-plane-matrixset and am wondering wether from there it makes sense to go for an opengl-architecture (which is hardware accelerated but deals with 4 planes) or whether i should rather stick to my 1 plane-matrix using standard jitter. i do need to apply some effects:
- text overlays
- simple graphic overlays (sort of a moving onscreen slider and images with alpha)
- some keying with fixed alphamattes. the matte itself will be a still image but might need soft edges, all inputs will be greyscale vids.
- some normal fading from one greyscale vid to another.
- very basic image analysis (bang when live input gets dark)

the final output will be in greyscale too so basically i could go all the way with my 1-plane matrix. on the other hand i keep recording while rapidly shuttling around in the video and there's happening some other stuff too so i guess i will need some cpu-power.

opengl or not? that's the question...

any advice is highly appreciated!

karl krach's icon

for the records: as nobody would answer i tried to figure it out myself and after a little testing i ended up with opengl since even with my one-plane-matrix the cpu-load was way higher doing it with normal jitter...

nesa's icon

Hey, I'm working on a b&w stuff for past few months and in my case gpu way was more efficient, but it really depends on what sort of things you're doing in the shaders.

Since the fragment shaders would process all the 'planes' at once then we waste three channels, so I was thinking how could I use this sort of system and I ended up doing this:
- dividing the total width of the texture by 4;
- stuffing each quarter in the corresponding plane;
- doing some basic math to process all four pixels from four quadrants at once:
plane = round(x / block_size.x)
tc.x = x mod block_size.x
- making a shader to display only one plane(although this could be avoided with some jit.gl.sketch magic, but at the moment I'm so used to shaders that I didn't want to bother);
- arranging four videoplanes over total width of the window, each videoplane having a shader that displays only the intended colour channel.

This stuff is part of the max4live a/v instrument I'm building, tomorrow I'll do a premiere with it and later I'd like to release it, so then you can see the code I used. But it won't happen in following few weeks.

karl krach's icon

mhm, most interesting, thank you very much. since the stuff i do is quite basic, gpu-performance is fine but if i should run into problems i might give that a try too...
good luck tomorrow! :)

vade's icon

OpenGL should in theory be able to handle 1 plane texture formats without actually expanding to RGBA on the video card. You can do Luma, or Alpha textures, or 2 plane Luma + Alpha. It should be doable. Might be a nice feature request.

nesa's icon

Yes, you can 'handle' the 1 plane texture formats when you're _sampling_ them, but how would you go about setting the destination texture format and forcing the fragment processor to use only one channel?

I didn't go too deep with this, but before this trick I tried to sample only one channel and then write only to one component of vec4 gl_FragColor. I couldn't notice any speed difference, while with the trick mentioned I could see at least double the framerate.

I also remember reading on GPGPU forums that they had the same efficiency problem, but that was long ago - even before CUDA - I guess you're more updated on this subject.

vade's icon

Well right now in jitter you cant specify that (I don't think), which why I noted it as a feature request. I'd imagine you would like to be able to do [jit.g.texture 1 char 640x480 ] or 1 float32, etc, which would solve your issue.

Not all specified formats are stored in the same requested internal format. On OS X, there are some guaranteed 1 and 2 plane formats, but I think drivers on other platforms may not handle the specified format to internal format mappings the same way and things just get expanded to RGBA even if you request a 1 channel texture. I suspect a lot of these inconsistencies are reasons its not implemented in Jitter, because its hard to guarantee how things behave behind the scenes (thats speculation speculation), and since most folks want to work RGBA, its rarely an issue so low on the feature list.

But to answer your question, you would set the destination texture format when you create your texture and hope the GL implementation does the right thing. From my limited understanding you can't control how your texture is actually stored, but can check for extensions that specify it should do X Y and Z. I suspect things were slower for you (before your shader solution) because you were still moving around 4x the data (RGBA vs just R, or A, or what have you) since the texture was specified RGBA in Jitter.

This has some info discussing a similar topic: http://lists.apple.com/archives/Mac-opengl/2010/Jan/msg00025.html , it might be at least somewhat interesting?

Joshua Kit Clayton's icon

As for defining textures, yes, you can already define 1 and 2 plane textures with jit.gl.texture @colormode luma, alpha, intensity, or lumalpha. AFAIK, once sampled, these *all* become a vec4 value on the graphics hardware upon sampling, and all gl_fragColor memory writes are vec4 operations. That's how the hardware is typically designed under the hood, and our destination textures in jit.gl.slab should all assume 4 planes.

There might be some special way to render to single plane aux buffers or other advanced (and more hardware specific) means, but I believe that CUDA and OpenCL programs often compile down appropriately parallelizable 1 or 2 plane operations to the multiplexed strategy that Nesa suggested.

So my recommendations are to use the 1 or 2 plane texture formats as sources and not worry about the shaders or destination textures until you get to the point where memory access on the graphics card within your shader chain is your bottleneck. At that point, first try to write fewer shaders with more instructions, thus rendering to texture for subsequent read less times. And finally, when that's not enough, and the complexity of writing a specific interleaved format shader like Nesa mentioned is worth that extra performance boost in your application, go for it.

Often times, you might find that your movie decompression is a more significant bottle neck than your GPU arithmetic or memory bandwidth, so worrying about this stuff prematurely is the worst kind of optimization (the one that doesn't help your real world situation).

Btw similar things apply to UYVY textures and if you check out the provided jitter shaders for interpreting UYVY data masquerading as RGBA, and vice versa for readback, you can see some of these sorts of strategies for building other colorspaces and data layout on top of 4 plane RGBA at work.

-Joshua

vade's icon

Hi Joshua. Thanks for setting me straight about jit.gl.texture, thats great and sorry I missed that. Does jit.gl.slab respect underlaying texture format requests, ie, can I have a slab that returns an intensity texture?

My understanding is the vec4 output from gl_FragColor in an FBO that is bound to a texure attachment with a non RGBA format should be respected, and internally write to only the necessary channels. texture formats if you use the correct mappings:

< R, RG, L, I, A > to < RGBA > mappings:
RED: < R, 0, 0, 1 >
RG: < R, G, 0, 1 >
LUMINANCE: < L, L, L, 1 >
INTENSITY: < I, I, I, I >
ALPHA: < 0, 0, 0, A >

So while you write out a vec4 in your shader, only those components required are respected, so you will use less memory in the resulting rendered texture. (Assuming you have the right GL extensions).

Joshua Kit Clayton's icon

No render to format support. Only sample from. We enforce an RGBA destination texture.

To be honest, any non RGBA render target support isn't high on our priorities, with various backends: rtt (pbuffer), ctt, fbo, mixed with various hardware platforms, mixed with various operating systems, etc. I'm not super excited to expose these kinds of things.

But regardless, you maximize both arithmetic and memory bandwidth to do what Nesa suggests. Otherwise you are wasting a four way parallel processing pipeline for a single channel as well as the memory.

However, we have some much more exciting and generally useful things we're working on that will improve GPU pixel processing performance for cascaded operations in a way that isn't so tied to this level of implementation.

vade's icon

Yeah, I figured all of the backends and various implementation nuances would render some of that stuff moot or at least such a pain it would not be worth it. Thanks for clearing it up :)