Jun 29 2007 | 11:00 am

I'm experimenting with some pattern matching/clustering/sorting stuff, and for some of it I've been using Euclidean distance. I'm wondering what the best approach is when dealing with different dimensionalities. That is, when 2 arrays are different lengths, is it better to pad the shorter one (presumably with 0.0s), or should I find some way of truncating the longer one? Truncating seems like it would be somewhat arbitrary in removing information from the longer array, but padding the shorter one is also kind of arbitrary, in a way... If I think of it just in terms of 2D and 3D spaces, then padding seems reasonable, as it would be like imagining the 2D point to be at 0.0 on the z-axis of a 3D space, which is certainly arbitrary, but not particularly disagreeable.

Any thoughts? Or is there some better overall method I should be considering, in cases where the arrays to be compared are of different lengths?

Thanks in advance.

J.

- Jun 29 2007 | 11:41 am
- Jun 29 2007 | 3:16 pmActually, if anyone has anything more to add, I'd still appreciate any thoughts.My question before was really whether it was a better approach to "pad" the lower-dimensional array, or to truncate the higher-dimensional array. I understand that the value of any padding would be arbitrary, however, what's not clear still is whether truncating, or some form of dimension reduction would be a better approach. The main reason I ask is because, in experimenting with the results, they are really quite different. Thinking about it now, I actually kind of feel inclined to truncate dimensions on the higher-dimensional array, since, if I again use 2D and 3D spaces as an example, it seems to makes more sense to reduce a 3D point to its 2D projection than to give an arbitrary z position for a 2D point. Yes? No? (I'd imagine that, for the geometrically-inclined, this is a bit like understanding why the first black key above C is sometimes a C# and sometimes a Db...)thanks in advance for any further thoughts,J.
- Jun 29 2007 | 4:46 pmIt really depends on how your dimensions are organized. That is, is this a case of x,y or more a case of dimension 1 = x, dimension 2 = y? Are they on the same scale? (and what are you using distance for?)There's many different ways of calculating distance. The problem with Euclidean distance is that it doesn't take scale into account. If x has a min and max range of 0, 10 and y has a min and max range of 1,2 moving all the way from the maximum to the minimum will be a significantly greater distance for x than if you were to do the same in the y dimension.Mahalanobis distance does take this into account, but will take more math chops, as you'll need to calculate a covariance matrix. Here's some links with info:Also, if you don't want to code it in Max (which I think would be a very good idea) you could use mxj java code to do it. Here's a link to a page with some Java statistical objects that will calculate covariance matrices as well as Mahalanobis distances.I would not recommend padding your values because it's only going to skew your data. If you have full data for only two dimensions, then I'd use those two dimensions.From my limited experience with Music Information Retrieval, in my project I took an array with 178 variables per entry and used Principal Component Analysis to figure out which were the most important elements in terms of usefulness as classifiers, and then calculated Mahalanobis distance using the 8 most important dimensions. The catch with 8 dimensions being that there's no good way to visualize it all at once, but it definitely worked well. This was in MATLAB, though.If you'd like some more background information on Music Information Retrieval, the course notes for the class I took taught by Juan Bello are here: http://homepages.nyu.edu/~jb2843/Teaching.htmlActually, if any coders would be interested in writing a Matlab to Max object, that'd be pretty cool...Peter McCulloch www.petermcculloch.com
- Jun 29 2007 | 5:55 pmQuote: peter.mcculloch@gmail.com wrote on Fri, 29 June 2007 17:46 ---------------------------------------------------- > It really depends on how your dimensions are organized. That is, is > this a case of x,y or more a case of dimension 1 = x, dimension 2 = y? > Are they on the same scale? (and what are you using distance for?) > > There's many different ways of calculating distance. The problem with > Euclidean distance is that it doesn't take scale into account. If x > has a min and max range of 0, 10 and y has a min and max range of 1,2 > moving all the way from the maximum to the minimum will be a > significantly greater distance for x than if you were to do the same in > the y dimension. > > Mahalanobis distance does take this into account, but will take more > math chops, as you'll need to calculate a covariance matrix. Here's > some links with info: > > http://en.wikipedia.org/wiki/Mahalanobis_distance > http://en.wikipedia.org/wiki/Covariance > > Also, if you don't want to code it in Max (which I think would be a > very good idea) you could use mxj java code to do it. Here's a link to > a page with some Java statistical objects that will calculate > covariance matrices as well as Mahalanobis distances. > > http://www.mhsatman.com/downloads.htm > > I would not recommend padding your values because it's only going to > skew your data. If you have full data for only two dimensions, then > I'd use those two dimensions. > > From my limited experience with Music Information Retrieval, in my > project I took an array with 178 variables per entry and used Principal > Component Analysis to figure out which were the most important elements > in terms of usefulness as classifiers, and then calculated Mahalanobis > distance using the 8 most important dimensions. The catch with 8 > dimensions being that there's no good way to visualize it all at once, > but it definitely worked well. This was in MATLAB, though. > > If you'd like some more background information on Music Information > Retrieval, the course notes for the class I took taught by Juan Bello > are here: > http://homepages.nyu.edu/~jb2843/Teaching.html > > Actually, if any coders would be interested in writing a Matlab to Max > object, that'd be pretty cool... > > > Peter McCulloch > www.petermcculloch.com > > ----------------------------------------------------Thanks, Peter. This is great info!I've been working for a while now on a recombinance-based composition system, sort of quasi-Cope, but geared more toward realtime, interactive composition. I'm finding that my model has a strong tendency to replicate the source works in the database, rather than generate variations, so I'm re-thinking some of the basic stuff for selecting and combining the linear materials - motives, themes, and so on. I want a system that can do an okay job on its own (actually, if you know Cope's SPEAC stuff, I want it to settle into a sort of eternal "E", Extension, if left to its own devices), but which really benefits from being "steered" through the musical form by the user. So, what I imagined doing in my latest design, was to create a somewhat smooth space containing all the linear material (I've parsed everything in the database into melodic "chunks" called VoiceSegments) in which I can move by step or leap "away from" the original setting, and maintain a somewhat predictable degree of continuity (or discontinuity) with the original. The idea is that if VoiceSegment 100 is the original for a given setting, then I could use 99 or 101 and get a closely-related alternative VoiceSegment, whereas 50 would only show a distant connection to the original, if any at all (and 150 would also be distant, though in a different way). So, basically I'm trying to sort my VoiceSegments according to similarity. I have a sinking feeling that a SOM is going to be the best way to do this, but I'm really fuzzy on how to build a SOM in java (basically *all* of this is in 2 mxj objects, with a good number of classes loaded by each), and as I understand it, while SOMs are good at revealing similarities, they tend to have rather abrupt boundaries in the way they group input, and thus won't necessarily offer the smoothest transitions through *all* the material. If I'm wrong on this, and SOMs sound like the best approach to you, let me know! ;-)Anyway, I've narrowed my attributes down to 9, the first two of which are a pitch list and an ED list (delta times). I'm trying to use Euclidean distance to find the "proximity" of two pitch lists, or ED lists. In combination with the other 7 attributes, I'm hoping this will give me enough info to do a reasonable sort of all my melodic material. That's a really broad-stroke description, but you probably get the idea...thanks,J.
- Jun 29 2007 | 7:55 pmHi J.,There's been some work that might be helpful for you. It's MATLAB code, but might give some ideas about analysis. http://www.jyu.fi/musica/miditoolbox/Since you are working in Java, and you are working with musical data, I can't highly enough recommend purchasing JMSL if you haven't already. These phrases could be very easily stored in musicShapes and played very easily. I'm working on this type of stuff in JMSL using a MySql database, so let me know if you purchase it and I can send you some analysis code for musicShapes.Similarity cuts across a lot of dimensions; things can be similar in terms of rhythm, pitch, contour, density, register, dynamic, articulation, etc. By having different types of similarity, you should get significantly more interesting output from the system. For instance, find a phrase that is similar in terms of rhythm, pitch, and contour, but not register. PCA will help you pick the most unique parameters.I would consider looking at statistical properties in addition to your sequential approach; these will be particularly effective in finding patterns that have similar content but dissimilar ordering. (e.g. an arpeggio up vs an arpeggio down) Chroma vectors could be very effective, as they're octave-equivalent and easily transposable/invertable.For instance, a chromatic scale of quarter notes from C to E followed by a half note B and then a dotted half-note on C would yield a chroma vector of 4 1 1 1 1 0 0 0 0 0 0 2Some statistical properties that might be interesting to look at:mean, variance, (standard deviation around mean, standard deviation around median, kurtosis, skew) for: pitch onset times release times duration velocityAlso, properties such as (number of unique contour values / number of notes) can be interesting. A repeated arpeggio 60 71 63 60 71 63 will have a ratio of 0.5 ( count(1 2 3) --> 3 / 6 ) whereas 60 64 63 67 66 65 will have a ratio of 1 (count(1 3 2 6 5 4) --> 6 / 6). More repetitions will drive the ratio even lower.The other great advantage of statistical time-invariant properties is that they make your search stage significantly faster, since you're just comparing single pre-derived numbers.Peter McCullochwww.petermcculloch.com
- Jun 29 2007 | 8:24 pmThe CNMAT Matlab Object:http://www.cnmat.berkeley.edu/MAX/downloads/files/OSX-CFM/ matlabcommunicate_1.1.2.sitbOn Jun 29, 2007, at 12:55 PM, Peter McCulloch wrote:> Hi J., > > There's been some work that might be helpful for you. It's MATLAB > code, but might give some ideas about analysis. > http://www.jyu.fi/musica/miditoolbox/ > > Since you are working in Java, and you are working with musical > data, I can't highly enough recommend purchasing JMSL if you > haven't already. These phrases could be very easily stored in > musicShapes and played very easily. I'm working on this type of > stuff in JMSL using a MySql database, so let me know if you > purchase it and I can send you some analysis code for musicShapes. > > Similarity cuts across a lot of dimensions; things can be similar > in terms of rhythm, pitch, contour, density, register, dynamic, > articulation, etc. By having different types of similarity, you > should get significantly more interesting output from the system. > For instance, find a phrase that is similar in terms of rhythm, > pitch, and contour, but not register. PCA will help you pick the > most unique parameters. > > I would consider looking at statistical properties in addition to > your sequential approach; these will be particularly effective in > finding patterns that have similar content but dissimilar > ordering. (e.g. an arpeggio up vs an arpeggio down) Chroma > vectors could be very effective, as they're octave-equivalent and > easily transposable/invertable. > > For instance, a chromatic scale of quarter notes from C to E > followed by a half note B and then a dotted half-note on C would > yield a chroma vector of > 4 1 1 1 1 0 0 0 0 0 0 2 > > Some statistical properties that might be interesting to look at: > > mean, variance, (standard deviation around mean, standard deviation > around median, kurtosis, skew) for: > pitch > onset times > release times > duration > velocity > > Also, properties such as (number of unique contour values / number > of notes) can be interesting. A repeated arpeggio 60 71 63 60 71 > 63 will have a ratio of 0.5 ( count(1 2 3) --> 3 / 6 ) whereas 60 > 64 63 67 66 65 will have a ratio of 1 (count(1 3 2 6 5 4) --> 6 / > 6). More repetitions will drive the ratio even lower. > > The other great advantage of statistical time-invariant properties > is that they make your search stage significantly faster, since > you're just comparing single pre-derived numbers. > > > Peter McCulloch > > www.petermcculloch.com >-- barry threw Media Art and Technology http://www.barrythrew.com me(at)barrythrew(dot)com 857-544-3967And I know not if, save in this, such gift be allowed to man, That out of three sounds he frame, not a fourth sound, but a star. -Robert Browning
- Jun 29 2007 | 8:34 pmQuote: peter.mcculloch@gmail.com wrote on Fri, 29 June 2007 20:55 ---------------------------------------------------- > Hi J., > > There's been some work that might be helpful for you. It's MATLAB > code, but might give some ideas about analysis. > http://www.jyu.fi/musica/miditoolbox/ > > Since you are working in Java, and you are working with musical data, I > can't highly enough recommend purchasing JMSL if you haven't already. > These phrases could be very easily stored in musicShapes and played > very easily. I'm working on this type of stuff in JMSL using a MySql > database, so let me know if you purchase it and I can send you some > analysis code for musicShapes. > > Similarity cuts across a lot of dimensions; things can be similar in > terms of rhythm, pitch, contour, density, register, dynamic, > articulation, etc. By having different types of similarity, you should > get significantly more interesting output from the system. For > instance, find a phrase that is similar in terms of rhythm, pitch, and > contour, but not register. PCA will help you pick the most unique > parameters. > > I would consider looking at statistical properties in addition to your > sequential approach; these will be particularly effective in finding > patterns that have similar content but dissimilar ordering. (e.g. an > arpeggio up vs an arpeggio down) Chroma vectors could be very > effective, as they're octave-equivalent and easily > transposable/invertable. > > For instance, a chromatic scale of quarter notes from C to E followed > by a half note B and then a dotted half-note on C would yield a chroma > vector of > 4 1 1 1 1 0 0 0 0 0 0 2 > > Some statistical properties that might be interesting to look at: > > mean, variance, (standard deviation around mean, standard deviation > around median, kurtosis, skew) for: > pitch > onset times > release times > duration > velocity > > Also, properties such as (number of unique contour values / number of > notes) can be interesting. A repeated arpeggio 60 71 63 60 71 63 will > have a ratio of 0.5 ( count(1 2 3) --> 3 / 6 ) whereas 60 64 63 67 66 > 65 will have a ratio of 1 (count(1 3 2 6 5 4) --> 6 / 6). More > repetitions will drive the ratio even lower. > > The other great advantage of statistical time-invariant properties is > that they make your search stage significantly faster, since you're > just comparing single pre-derived numbers. > > > Peter McCulloch > > www.petermcculloch.com > > ----------------------------------------------------You know, I looked at JMSL a while ago, but I was only looking for notation stuff at the time, so I didn't make much use of it. As it stands, I've built all of this stuff very much along Cope's ideas, laid out in "Computer Models of Musical Creativity". So it's not too practical for me to totally switch gears into JMSL's objects. My VoiceSegment object is a java class, with quite a bit of detailed data, and references to many elements of the musical stucture of the analysed material, so I can't see myself dropping the model any time soon. However, I will definitely look into some of the properties you mention, as they will probably help me generate greater variety, with appropriate restrictions. Cope's approach is tied in with voice leading on its lowest levels, so my data structure also works on this basic foundation. And in this regard, my current code actually works very well. I can add a number of different works to the database, and the system will navigate convincing transitions from one source work to another. The problem is that all of the *vertical* material is always from the same source work, which means I'm basically only getting a sort of medley out of it... which is obviously not what I'm after! ;-)I've been working with similarity measured by distance between pitch vectors, rhythm vectors (ED lists), total duration (in ticks), periodicity, kinesis, interval sum and mean interval, tessitura, and cardinality. It seems to me that, between those properties, I should be able to get a pretty decent measure of similarity. The one thing that I feel is missing is some sense of *where* the rhythmic activity is focused... I'm not sure how to express that, but perhaps your suggestions will point me in the right directions.You've given me a lot to look into, so I'll chew on this for a couple of days, then get back to you.Thanks again,J.
- Jun 29 2007 | 8:40 pmI should mention, just to be clear, that the limitations in my code not drawing on enough variety in the source works are due to problems in my own design, not anything in Cope. It's part of the hybrid/realtime aspect of my model, which is admittedly still in its early stages. I approached certain things very differently than Cope, and this is where my system is getting into some trouble - a realtime system being glued into an essentially non-realtime system... I'll work it out, though. Eventually.J.
- Jun 29 2007 | 9:01 pmTypically when you embed a lower dimensional space in a higher one the padded values are 0. For 2D, this would be a place going through the origin of a 3D space in the XY plane. If you were to truncate, you would be orthogonally projecting 3D points into the XY plane.wesOn 6/29/07, jbmaxwell wrote: > > Actually, if anyone has anything more to add, I'd still appreciate any thoughts. > > My question before was really whether it was a better approach to "pad" the lower-dimensional array, or to truncate the higher-dimensional array. I understand that the value of any padding would be arbitrary, however, what's not clear still is whether truncating, or some form of dimension reduction would be a better approach. The main reason I ask is because, in experimenting with the results, they are really quite different. Thinking about it now, I actually kind of feel inclined to truncate dimensions on the higher-dimensional array, since, if I again use 2D and 3D spaces as an example, it seems to makes more sense to reduce a 3D point to its 2D projection than to give an arbitrary z position for a 2D point. Yes? No? > (I'd imagine that, for the geometrically-inclined, this is a bit like understanding why the first black key above C is sometimes a C# and sometimes a Db...) > > thanks in advance for any further thoughts, > > J. >
- Jun 30 2007 | 5:05 am
- Jun 30 2007 | 11:02 amOkay, I've had some to look over this message in greater detail...> Chroma vectors could be very > effective, as they're octave-equivalent and easily > transposable/invertable. ;> For instance, a chromatic scale of quarter notes from C to E followed > by a half note B and then a dotted half-note on C would yield a chroma > vector of > 4 1 1 1 1 0 0 0 0 0 0 2Yes, this could be handy. I'll implement a chroma vector method in my VoiceSegment class - should be pretty easy, and could provide some valuable info! The one thing I'm not sure about is how to deal with less 'square' durations - I'm assuming there's nothing wrong with using doubles for the actual values, and thus representing duration more precisely?> > Some statistical properties that might be interesting to look at: > > mean, variance, (standard deviation around mean, standard deviation > around median, kurtosis, skew) for: > pitch > onset times > release times > duration > velocity >I've found some open source java code for getting statistical info from double[]s, so I'll implement some of the above properties right away. I've actually previously done some work with these in Max, using the free versions of the Litter objects (is it lp.stacey...??), which I found quite effective, though I think I'll get more use from them now, as the overall design of my app provides a better context for taking advantage of the info.> Also, properties such as (number of unique contour values / number of > notes) can be interesting. A repeated arpeggio 60 71 63 60 71 63 will > have a ratio of 0.5 ( count(1 2 3) --> 3 / 6 ) whereas 60 64 63 67 66 > 65 will have a ratio of 1 (count(1 3 2 6 5 4) --> 6 / 6). More > repetitions will drive the ratio even lower. >I'm guessing contour values are equivalent to melodic intervals...? Yes, this could be an interesting value - I'll play around with that. Does this value already have a name?> The other great advantage of statistical time-invariant properties is > that they make your search stage significantly faster, since you're > just comparing single pre-derived numbers. >Yes, I've done quite a bit of analysis on my data (parsed midi files) already, in order to build my current music database. However, the values I was after were really directly related to Cope's SPEAC system, so I'll look further into your suggestions. One thing I've found very useful, however, in using Euclidean distance as a method for sorting musical fragments is that it's easy to "bias" the values during playback. Basically, given an array of values indicating the distances of various properties, I create an array of biasing factors, and multiply the two, then re-sort the list of fragments. This makes it easy to bias the sorting toward (or away from) a given property. I've used this quite effectively in my current model, but I'm using it in a rather limited way, at the moment. Of course, it's good to limit the number of fragments in some way (a threshold, or classes of fragments), so as to avoid huge loops through the biasing function (and huge sorts afterward). One thing I'm being very careful to avoid, however, is a sort of "parameter explosion", whereby realtime control (or even non-realtime control) becomes awkward as a result of accumulating too many parameters. So that's my only concern with throwing too many analysis attributes into the system. I'd rather let the recombinant aspect of my program handle as many of the details as possible, while I can concern myself more with tweaking a few high-level features, just to guide the logic that's already in place in a slightly different direction.Thanks for all the suggestions!J.
- Jun 30 2007 | 4:15 pm> Yes, this could be handy. I'll implement a chroma vector method in my > VoiceSegment class - should be pretty easy, and could provide some > valuable info! The one thing I'm not sure about is how to deal with > less 'square' durations - I'm assuming there's nothing wrong with > using doubles for the actual values, and thus representing duration > more precisely? >Absolutely. I forgot to mention that you'll want to normalize the values so that they sum to 1.> I've found some open source java code for getting statistical info > from double[]s, so I'll implement some of the above properties right > away. I've actually previously done some work with these in Max, using > the free versions of the Litter objects (is it lp.stacey...??), which > I found quite effective, though I think I'll get more use from them > now, as the overall design of my app provides a better context for > taking advantage of the info. >If you wouldn't mind sending links for these on/off-list, I'd be appreciative, as they may be helpful for what I'm working on.>> Also, properties such as (number of unique contour values / number of >> notes) can be interesting. A repeated arpeggio 60 71 63 60 71 63 will >> have a ratio of 0.5 ( count(1 2 3) --> 3 / 6 ) whereas 60 64 63 67 66 >> 65 will have a ratio of 1 (count(1 3 2 6 5 4) --> 6 / 6). More >> repetitions will drive the ratio even lower. >> > > I'm guessing contour values are equivalent to melodic intervals...? > Yes, this could be an interesting value - I'll play around with that. > Does this value already have a name?Now that I think of it, there's no need to find the contour to find that value; you could just count the number of distinct values in the set / number of values. It's just a measure of pitch uniqueness. Contour values have the nice property of being non-specific to pitch or interval content in capturing a rough shape. For instance, they're great for finding the equivalence in this "60 63 62 64" and "53 63 61 71" as they'd both have the same contour of "1 3 2 4". These two phrases are similar when you play them, but chroma and pitch-sets will be less effective in finding that relationship.PCA will throw away the ones that are redundant to other parameters and leave you with the N most important ones. From there on out, you can analyze for just those parameters (both in storage and retrieval), so it's more efficient than it looks. Sometimes the ones you wouldn't expect end up being good discriminants because other more likely candidates were redundant. (though I definitely understand not wanting to mess with all of this either) You can always start a low priority thread in Java to do the less important analysis, so that you'll have it if you want it. There's an object out there called threadPool that works pretty decently for managing tasks like this.The biasing is a good idea. I haven't done it with Mahalanobis, but I think it's possible, and would have the advantage of already being normalized before biasing. (so your biasing factor X by an extra 0.2 will have the same effect as doing the same thing to factor Y)To further optimize you could use K-means clustering to pre-classify your segments once you've found a good set of classifiers. By divying it into groups in advance, you'll save a lot of comparisons. (though it sounds like you're already doing something similar)Sounds like a cool project, and hope any of this is helpful.Peter McCulloch
- Jun 30 2007 | 10:00 pmQuote: peter.mcculloch@gmail.com wrote on Sat, 30 June 2007 17:15 ---------------------------------------------------- > > Yes, this could be handy. I'll implement a chroma vector method in my > > VoiceSegment class - should be pretty easy, and could provide some > > valuable info! The one thing I'm not sure about is how to deal with > > less 'square' durations - I'm assuming there's nothing wrong with > > using doubles for the actual values, and thus representing duration > > more precisely? > > > > Absolutely. I forgot to mention that you'll want to normalize the > values so that they sum to 1. > > > I've found some open source java code for getting statistical info > > from double[]s, so I'll implement some of the above properties right > > away. I've actually previously done some work with these in Max, using > > the free versions of the Litter objects (is it lp.stacey...??), which > > I found quite effective, though I think I'll get more use from them > > now, as the overall design of my app provides a better context for > > taking advantage of the info. > > > > If you wouldn't mind sending links for these on/off-list, I'd be > appreciative, as they may be helpful for what I'm working on.Well, I haven't tested this class yet, but it seems to cover most of the essentials (though it's actually not complete, as you'll notice at the bottom of the code!). I hadn't noticed before where it came from... the miracle of Google:http://mail-archives.apache.org/mod_mbox/jakarta-commons-dev/200306.mbox/%3C20030617225539.14570.qmail@icarus.apache.org%3E I don't actually know about contour values. Can you explain them quickly? From your example, it looks to be the relative "rank" of the notes, in ascending order, indicated as 1-based indices? I know I should probably know this already. gulp.Thanks for the info - very helpful.J.
- Jun 30 2007 | 11:04 pm> I don't actually know about contour values. Can you explain them > quickly? From your example, it looks to be the relative "rank" of the > notes, in ascending order, indicated as 1-based indices? I know I > should probably know this already. gulp. >That's what I know of it, but I'm sure there has been more advanced work done with. Running contour might be interesting (maybe 2 notes before, 2 notes after, though that's quickly on the order of over 120 different permutations) or you can do contour over an entire phrase. I'm guessing that contour over an entire phrase could be particularly interesting in comparison with changes in sign of interval.This article seems to have a general bibliography on the topic: http://personal.systemsbiology.net/ilya/Publications/JNMRcontour.pdfI just noticed Larry Polansky is on the list; IIRC he was one of the original authors of HMSL (which later became JMSL). Here's his page: http://eamusic.dartmouth.edu/~larry/mutationsFAQ.htmlPeter McCulloch www.petermcculloch.com
- Jul 03 2007 | 5:48 pmPeter McCulloch schrieb: > Actually, if any coders would be interested in writing a Matlab to Max > object, that'd be pretty cool...If you look at Matlab code, and at Java, it doesn't look that different. On the recent project with Bill Sethares, we are just redoing it in Java to get it into real time. Automatic translation would be nice, but I doubt its worth to investigate, as you have to tweak it anyway... Now most of the stuff is running in mxj~ already, I'll keep you posted...Stefan-- Stefan Tiedje------------x------- --_____-----------|-------------- --(_|_ ----|-----|-----()------- -- _|_)----|-----()-------------- ----------()--------www.ccmix.com
- Oct 24 2011 | 9:53 amThis is a mahalanobis calculator I have been using - paste the methods into a java and reference w/mxj. You will need to resample vectors to the same dimensions, but it gives a distance for two sample groups that can be of different sizes. Java Matrix Package for some stepss http://math.nist.gov/javanumerics/jama/ so you need to download JAMA too. Let me know if there are probems - /i cloned an excel from a site of data mining how-tos..I need to try this n Matlabpublic double mahalanobisDistance(double[][] g1, double[][] g2) { if (g1[0].length != g2[0].length) throw new UnsupportedOperationException("dimension mismatch");double[] g1means = columnMeansOf(g1); double[] g2means = columnMeansOf(g2);double[][] g1Ctr = columnCenter(g1,g1means); double[][] g2Ctr = columnCenter(g2,g2means);double[][] g1Cov = covarianceOf(g1Ctr); double[][] g2Cov = covarianceOf(g2Ctr);double objCount = (double) g1.length + (double) g2.length; double[][] g1W = matrixMultiply(g1Cov, ((double)g1.length / objCount)); double[][] g2W = matrixMultiply(g2Cov, ((double)g2.length / objCount));double[][] pooledCov = matrixAdd(g1W, g2W); double[][] inversePCov = inverseOf(pooledCov); double[][] meanDiff = new double[g1means.length][1]; for (int i=0;i meanDiff[i][0] = g1means[i] - g2means[i]; double[][] dist2 = matrixMultiply(matrixMultiply(transpose(meanDiff), inversePCov), meanDiff); return Math.sqrt(dist2[0][0]); }public double[] centerOn(double[] arr, double value) { double[] cArr = new double[arr.length]; for (int i = 0; i < cArr.length; i++) { cArr[i] = arr[i] - value; } return cArr; }public double[][] covarianceOf(double[][] centeredGroup) { return matrixMultiply (matrixMultiply(transpose(centeredGroup), centeredGroup),(1D / (double) centeredGroup.length)); }public double[][] columnCenter(double[][] group, double[] means) { double[][] groupT = transpose(group); for (int i=0;i groupT[i] = centerOn(groupT[i], means[i]); } return transpose(groupT); }public double[] columnMeansOf(double[][] group) { double[] colMeans = new double[group[0].length]; for (int i=0;i double[] col = new double[group.length]; for (int j=0;j col[j] = group[j][i]; } colMeans[i] = findMean(col); } return colMeans; }public double findMean(double[] arr) { double lengthdouble = (double)arr.length; double arrSum = 0D; for (int i = 0; i < arr.length; i++) { arrSum += arr[i]; } return arrSum / lengthdouble; }public double[][] identity(int size) { double[][] id = new double[size][size]; for (int i=0;i id[i][i] = 1D; return id; }/** * using Jama.Matrix#solve * @param m1 * @return */ public double[][] inverseOf(double[][] m1) { Jama.Matrix jM1 = new Jama.Matrix(m1); Jama.Matrix jID = new Jama.Matrix(identity(m1[0].length)); Jama.Matrix jI1 = jM1.solve(jID); return jI1.getArray(); }public double[][] matrixAdd(double[][] m1, double[][] m2) { if (m1.length != m2.length || m1[0].length != m2[0].length) throw new UnsupportedOperationException("dimen. mismatch");double[][] r = new double[m1.length][m1[0].length]; for (int i=0;i for (int j=0;j r[i][j] = m1[i][j] + m2[i][j]; } } return r; }/** * * @param m1 * @param s * @return non-in-place matrix multiply of arr1 matrix and s */ public double[][] matrixMultiply(double[][] m1, double s) { return matrixMultiply(m1, s, false); }public double[][] matrixMultiply(double[][] m1, double s, boolean inPlace) { double[][] r = null; if (inPlace) r = m1; else r = new double[m1.length][m1[0].length]; for (int i=0;i for (int j=0;j r[i][j] = m1[i][j] * s; } } return r; }public double[][] matrixMultiply(double[][] m1, double[][] m2) { if (m1[0].length != m2.length) throw new UnsupportedOperationException("Cannot multiply a " + m1.length + "x" + m1[0].length + " matrix with a " + m2.length + "x" + m2[0].length + "matrix"); double[][] r = new double[m1.length][m2[0].length]; for (int i = 0; i < m1.length; i++) { for (int j = 0; j < m2[0].length; j++) { for (int k = 0; k < m1[0].length; k++) { r[i][j] += m1[i][k] * m2[k][j]; } } } return r; }public double[][] transpose(double[][] m1) { double[][] r = new double[m1[0].length][m1.length]; for (int i=0;i for (int j=0;j r[i][j] = m1[j][i]; } } return r; }
- Oct 24 2011 | 9:54 am