Forums > Dev

a cheap [trunc~] external for windows

January 19, 2010 | 10:47 pm

yes! …and they said i was mad… inline assembly pays off! for anyone running a pc with sse3 capability, you might like to try compiling the following code with vc++. it’s based on the simplemsp~ example. this should give vastly-improved performance over the standard [trunc~].

obviously some loop unrolling could be done etc., and i’ve probably missed some obvious opportunities for optimisation (not to mention the fact that the object doesn’t check to see if its enabled or not, it just runs…) but i get a massive speed boost on my shitty old laptop!

comments/criticisms welcomed and encouraged!

#include "ext.h"
#include "ext_obex.h"
#include "z_dsp.h"

typedef struct _td_dot_cheaptrunc_tilde {
	t_pxobject ob;
} t_td_dot_cheaptrunc_tilde;

void *td_dot_cheaptrunc_tilde_new(t_symbol *s, long argc, t_atom *argv);
void td_dot_cheaptrunc_tilde_free(t_td_dot_cheaptrunc_tilde *x);
void td_dot_cheaptrunc_tilde_dsp(t_td_dot_cheaptrunc_tilde *x, t_signal **sp, short *count);
t_int *td_dot_cheaptrunc_tilde_perform(t_int *w);

void *td_dot_cheaptrunc_tilde_class;

int main(void) {
	t_class *c;
	c = class_new("td.cheaptrunc~", (method)td_dot_cheaptrunc_tilde_new, (method)dsp_free, (long)sizeof(t_td_dot_cheaptrunc_tilde), 0L, A_GIMME, 0);
	class_addmethod(c, (method)td_dot_cheaptrunc_tilde_dsp, "dsp", A_CANT, 0);
	class_dspinit(c);
	class_register(CLASS_BOX, c);
	td_dot_cheaptrunc_tilde_class = c;
	return 0;
}

void td_dot_cheaptrunc_tilde_dsp(t_td_dot_cheaptrunc_tilde *x, t_signal **sp, short *count) {
	dsp_add(td_dot_cheaptrunc_tilde_perform, 4, x, sp[0]->s_vec, sp[1]->s_vec, sp[0]->s_n);
}

__inline __declspec(naked) t_int *td_dot_cheaptrunc_tilde_perform(t_int *w) {
	__asm {
			push	ebp				; save return address
			mov		ebp,	esp		; enter stack frame
			mov		eax,	[ebp+8]		; eax = *w
			sub		esp,	4		; reserve space for temp
			mov		ebx,	[eax+8]		; ebx = *in
			mov		edx,	[eax+12]	; edx = *out
			mov		ecx,	[eax+16]	; ecx = n
		loopstart:
			fld		dword ptr [ebx]		; st(0) = in[n]
			fisttp	dword ptr [ebp-4]		; temp = (truncatedint) in[n]
			add		ebx,	4		; *in ++
			fild	dword ptr [ebp-4]		; st(0) = temp
			fstp	dword ptr [edx]			; out[n] = (float) st(0)
			add		edx,	4		; *out ++
			dec		ecx			; decrement loop counter
			jnz		loopstart		; loop
			add		eax,	20		; *w += 5
			mov		esp,	ebp		; tidy up stack
			pop		ebp
			ret
	}
}

void *td_dot_cheaptrunc_tilde_new(t_symbol *s, long argc, t_atom *argv) {
	t_td_dot_cheaptrunc_tilde *x = NULL;
	if (x = (t_td_dot_cheaptrunc_tilde *)object_alloc(td_dot_cheaptrunc_tilde_class)) {
		dsp_setup((t_pxobject *)x, 1);
		outlet_new(x, "signal");
	}
	return (x);
}

January 20, 2010 | 7:13 pm

Cool! Do you have any benchmarks you can share with us?


January 20, 2010 | 10:01 pm

And I thought nobody cared :)

I must confess that as a total hobbyist programmer, I have no idea how to produce a proper benchmark as such. My system consists of running a shitload of instances in a max patch with the signal vector size set to 1, and keeping an eye on the CPU usage. Hi-tech, I know.

The problem with [trunc~] (and all the stock objects) is that it doesn’t take advantage of SSE instructions. There is a huge penalty in setting and re-setting the FPU control register for truncation instead of rounding, unless you use some trickery. SSE3 brought in the fisttp instruction, simplifying things greatly, because it lets you pop a float off the FPU stack with truncation, without having to touch the control register.

More importantly, there are errors in the above code. See the revised code below for the correct version!

__inline __declspec(naked) t_int *td_dot_cheaptrunc_tilde_perform(t_int *w) {
	__asm {
			push	ebp
			mov	ebp,	esp
			sub	esp,	4
			push	ebx

			mov	eax,	[ebp+8]
			mov	ebx,	[eax+8]
			mov	edx,	[eax+12]
			mov	ecx,	[eax+16]

		loopstart:
			fld	dword ptr [ebx]
			fisttp	dword ptr [ebp-4]
			add	ebx,	4
			fild	dword ptr [ebp-4]
			fstp	dword ptr [edx]
			add	edx,	4
			sub	ecx,	1
			jnz	loopstart

			add	eax,	20

			pop	ebx
			add	esp,	4
			pop	ebp
			ret
	}
}

January 21, 2010 | 1:43 pm

looks cool…

out of curiosity: how did you learn assembly? could you suggest me any books, online resources, … ?

thanks!
aa


January 21, 2010 | 6:22 pm

I learned by copying bits of old demo code. Trial and error, trial and error… I think I spent a good 3 or 4 months writing a ‘setpixel’ function and making it fast! trial and error, counting the reboots!

I can’t vouch for this book but it sounds right up my street – with it, the author, Jeff Duntemann, proposes to teach assembly as a first programming language. I think it’s a brilliant idea. Once you’ve been thrown in at the deep end, nothing will scare you anymore!
http://www [dot] duntemann [dot] com/assembly [dot] htm

Aside from lots of random googling, there are some things that will definitely help you: a hexadecimal calculator, lots of pens and paper, and a good supply of coffee/tobacco/whateveryerpoison. I really like to smoke a few joints while I’m coding, I find it helps me to think laterally. Plus it makes you less angry when you crash your machine for the n^nth time.

Online resources that are *worth* reading are few and far between… A lot of tutorials are good for explaining what the registers are and how they work, but most of them are from the dark ages of MS-DOS. If you see any assembly code for x86 that makes lots of "int" calls, it’s probably 16-bit.

I was bashed a couple of days ago (in the nicest possible way!) for wasting my time with assembly. It’s true, most compilers can out-code me any day. But I like to feel what my CPU is doing. Plus my C code is UGLY. I know I couldn’t write it any faster in C, but a lot of people on this board probably could with ease.

BTW, if anyone wants proof that REAL programmers can still beat compilers, check out this guy:
www [dot] azillionmonkeys [dot] com/qed/asm [dot] html

Paul Hsieh, I think his name is… Anyway, he’s incredible. The man must dream in binary :D

If you really want to get to know your computer, the first thing to do is to download the Pentium manual. It’s a good place to start learning the basic instructions. There really aren’t all that many, and after a while you get very familiar with the mnemonics. The x86 architecture makes a lot more sense to me than Motorola, that’s for sure! It won’t take long before you realise just how simple your CPU really is. It. is. a. dumb. machine. :)

After you’ve got a few little console programs working, try Agner Fog’s Pentium Optimisation guide, now hopelessly out of date. It’s old but it gives you an idea of what and why you might try optimising. Obviously modern processors don’t respond well to most of these old tricks, but it gets you thinking in the right way, by making you aware of what kind of things processors don’t like. Cache misses and stalls are pretty much out of the realm of hobbyist asm programmers, so don’t even worry yourself about them. This is the area in which compilers generally win hands down. But if you try to optimise your algorithms rather than the micro-timings of your code, you’ll become a better programmer in every language.

http://www [dot] cortstratton [dot] org/articles/OptimizingForSSE [dot] php
- This is brilliant. If you can get your head round this Matrix-Vector multiplication routine, many important assembly concepts will just ‘click’.

http://www [dot] flatassembler [dot] net/
- A lovely little assembler for dos, windows & linux. It’s a bit more niche than the big ones (MASM etc) but it provides a great all-in-one IDE/compiler/linker and a nice model for writing complete little apps from scratch. The forum is extremely clued-up and friendly.

win32assembly [dot] online [dot] fr/
- Iczelion’s Win32 Assembly homepage is great. Loads of info about how to get a Windows app up and running

One thing I can say with confidence is this: don’t bother with Randall Hyde’s "Art of Assembly Language". He is a good writer, and he encourages good programming practice, but if you follow his teachings you won’t actually learn any real asm. He has written an interpreter called HLA (high-level assembler) which converts C-style code into asm, before running it through somebody else’s assembler. Cheeky monkey.

Hope some of this helps. I’m tired and rambling, so I’ll sign off now :)

Have fun!


Viewing 5 posts - 1 through 5 (of 5 total)