a cheap [trunc~] external for windows
yes! ...and they said i was mad... inline assembly pays off! for anyone running a pc with sse3 capability, you might like to try compiling the following code with vc++. it's based on the simplemsp~ example. this should give vastly-improved performance over the standard [trunc~].
obviously some loop unrolling could be done etc., and i've probably missed some obvious opportunities for optimisation (not to mention the fact that the object doesn't check to see if its enabled or not, it just runs...) but i get a massive speed boost on my shitty old laptop!
comments/criticisms welcomed and encouraged!
#include "ext.h"
#include "ext_obex.h"
#include "z_dsp.h"
typedef struct _td_dot_cheaptrunc_tilde {
t_pxobject ob;
} t_td_dot_cheaptrunc_tilde;
void *td_dot_cheaptrunc_tilde_new(t_symbol *s, long argc, t_atom *argv);
void td_dot_cheaptrunc_tilde_free(t_td_dot_cheaptrunc_tilde *x);
void td_dot_cheaptrunc_tilde_dsp(t_td_dot_cheaptrunc_tilde *x, t_signal **sp, short *count);
t_int *td_dot_cheaptrunc_tilde_perform(t_int *w);
void *td_dot_cheaptrunc_tilde_class;
int main(void) {
t_class *c;
c = class_new("td.cheaptrunc~", (method)td_dot_cheaptrunc_tilde_new, (method)dsp_free, (long)sizeof(t_td_dot_cheaptrunc_tilde), 0L, A_GIMME, 0);
class_addmethod(c, (method)td_dot_cheaptrunc_tilde_dsp, "dsp", A_CANT, 0);
class_dspinit(c);
class_register(CLASS_BOX, c);
td_dot_cheaptrunc_tilde_class = c;
return 0;
}
void td_dot_cheaptrunc_tilde_dsp(t_td_dot_cheaptrunc_tilde *x, t_signal **sp, short *count) {
dsp_add(td_dot_cheaptrunc_tilde_perform, 4, x, sp[0]->s_vec, sp[1]->s_vec, sp[0]->s_n);
}
__inline __declspec(naked) t_int *td_dot_cheaptrunc_tilde_perform(t_int *w) {
__asm {
push ebp ; save return address
mov ebp, esp ; enter stack frame
mov eax, [ebp+8] ; eax = *w
sub esp, 4 ; reserve space for temp
mov ebx, [eax+8] ; ebx = *in
mov edx, [eax+12] ; edx = *out
mov ecx, [eax+16] ; ecx = n
loopstart:
fld dword ptr [ebx] ; st(0) = in[n]
fisttp dword ptr [ebp-4] ; temp = (truncatedint) in[n]
add ebx, 4 ; *in ++
fild dword ptr [ebp-4] ; st(0) = temp
fstp dword ptr [edx] ; out[n] = (float) st(0)
add edx, 4 ; *out ++
dec ecx ; decrement loop counter
jnz loopstart ; loop
add eax, 20 ; *w += 5
mov esp, ebp ; tidy up stack
pop ebp
ret
}
}
void *td_dot_cheaptrunc_tilde_new(t_symbol *s, long argc, t_atom *argv) {
t_td_dot_cheaptrunc_tilde *x = NULL;
if (x = (t_td_dot_cheaptrunc_tilde *)object_alloc(td_dot_cheaptrunc_tilde_class)) {
dsp_setup((t_pxobject *)x, 1);
outlet_new(x, "signal");
}
return (x);
}
Cool! Do you have any benchmarks you can share with us?
And I thought nobody cared :)
I must confess that as a total hobbyist programmer, I have no idea how to produce a proper benchmark as such. My system consists of running a shitload of instances in a max patch with the signal vector size set to 1, and keeping an eye on the CPU usage. Hi-tech, I know.
The problem with [trunc~] (and all the stock objects) is that it doesn't take advantage of SSE instructions. There is a huge penalty in setting and re-setting the FPU control register for truncation instead of rounding, unless you use some trickery. SSE3 brought in the fisttp instruction, simplifying things greatly, because it lets you pop a float off the FPU stack with truncation, without having to touch the control register.
More importantly, there are errors in the above code. See the revised code below for the correct version!
__inline __declspec(naked) t_int *td_dot_cheaptrunc_tilde_perform(t_int *w) {
__asm {
push ebp
mov ebp, esp
sub esp, 4
push ebx
mov eax, [ebp+8]
mov ebx, [eax+8]
mov edx, [eax+12]
mov ecx, [eax+16]
loopstart:
fld dword ptr [ebx]
fisttp dword ptr [ebp-4]
add ebx, 4
fild dword ptr [ebp-4]
fstp dword ptr [edx]
add edx, 4
sub ecx, 1
jnz loopstart
add eax, 20
pop ebx
add esp, 4
pop ebp
ret
}
}
looks cool...
out of curiosity: how did you learn assembly? could you suggest me any books, online resources, ... ?
thanks!
aa
I learned by copying bits of old demo code. Trial and error, trial and error... I think I spent a good 3 or 4 months writing a 'setpixel' function and making it fast! trial and error, counting the reboots!
I can't vouch for this book but it sounds right up my street - with it, the author, Jeff Duntemann, proposes to teach assembly as a first programming language. I think it's a brilliant idea. Once you've been thrown in at the deep end, nothing will scare you anymore!
http://www [dot] duntemann [dot] com/assembly [dot] htm
Aside from lots of random googling, there are some things that will definitely help you: a hexadecimal calculator, lots of pens and paper, and a good supply of coffee/tobacco/whateveryerpoison. I really like to smoke a few joints while I'm coding, I find it helps me to think laterally. Plus it makes you less angry when you crash your machine for the n^nth time.
Online resources that are *worth* reading are few and far between... A lot of tutorials are good for explaining what the registers are and how they work, but most of them are from the dark ages of MS-DOS. If you see any assembly code for x86 that makes lots of "int" calls, it's probably 16-bit.
I was bashed a couple of days ago (in the nicest possible way!) for wasting my time with assembly. It's true, most compilers can out-code me any day. But I like to feel what my CPU is doing. Plus my C code is UGLY. I know I couldn't write it any faster in C, but a lot of people on this board probably could with ease.
BTW, if anyone wants proof that REAL programmers can still beat compilers, check out this guy:
www [dot] azillionmonkeys [dot] com/qed/asm [dot] html
Paul Hsieh, I think his name is... Anyway, he's incredible. The man must dream in binary :D
If you really want to get to know your computer, the first thing to do is to download the Pentium manual. It's a good place to start learning the basic instructions. There really aren't all that many, and after a while you get very familiar with the mnemonics. The x86 architecture makes a lot more sense to me than Motorola, that's for sure! It won't take long before you realise just how simple your CPU really is. It. is. a. dumb. machine. :)
After you've got a few little console programs working, try Agner Fog's Pentium Optimisation guide, now hopelessly out of date. It's old but it gives you an idea of what and why you might try optimising. Obviously modern processors don't respond well to most of these old tricks, but it gets you thinking in the right way, by making you aware of what kind of things processors don't like. Cache misses and stalls are pretty much out of the realm of hobbyist asm programmers, so don't even worry yourself about them. This is the area in which compilers generally win hands down. But if you try to optimise your algorithms rather than the micro-timings of your code, you'll become a better programmer in every language.
http://www [dot] cortstratton [dot] org/articles/OptimizingForSSE [dot] php
- This is brilliant. If you can get your head round this Matrix-Vector multiplication routine, many important assembly concepts will just 'click'.
http://www [dot] flatassembler [dot] net/
- A lovely little assembler for dos, windows & linux. It's a bit more niche than the big ones (MASM etc) but it provides a great all-in-one IDE/compiler/linker and a nice model for writing complete little apps from scratch. The forum is extremely clued-up and friendly.
win32assembly [dot] online [dot] fr/
- Iczelion's Win32 Assembly homepage is great. Loads of info about how to get a Windows app up and running
One thing I can say with confidence is this: don't bother with Randall Hyde's "Art of Assembly Language". He is a good writer, and he encourages good programming practice, but if you follow his teachings you won't actually learn any real asm. He has written an interpreter called HLA (high-level assembler) which converts C-style code into asm, before running it through somebody else's assembler. Cheeky monkey.
Hope some of this helps. I'm tired and rambling, so I'll sign off now :)
Have fun!