Tuesday, May 3, 2011

Correlate one set of vectors to another in numpy?

Hi,

Let's say I have a set of vectors (readings from sensor 1, readings from sensor 2, readings from sensor 3 -- indexed first by timestamp and then by sensor id) that I'd like to correlate to a separate set of vectors (temperature, humidity, etc -- also all indexed first by timestamp and secondly by type).

What is the cleanest way in numpy to do this? It seems like it should be a rather simple function...

In other words, I'd like to see:

> a.shape 
(365,20)

> b.shape
(365, 5)

> correlations = magic_correlation_function(a,b)

> correlations.shape
(20, 5)

Cheers, /YGA

P.S. I've been asked to add an example.

Here's what I would like to see:

$ In [27]:  x
$ Out[27]: 
array([[ 0,  0,  0],
       [-1,  0, -1],
       [-2,  0, -2],
       [-3,  0, -3],
       [-4,  0.1, -4]])

$ In [28]: y
$ Out[28]: 
array([[0, 0],
       [1, 0],
       [2, 0],
       [3, 0],
       [4, 0.1]])

$ In [28]: magical_correlation_function(x, y)
$ Out[28]: 
array([[-1.        ,  0.70710678,  1.        ]
       [-0.70710678,  1.        ,  0.70710678]])

Ps2: whoops, mis-transcribed my example. Sorry all. Fixed now.

From stackoverflow
  • As David said, you should define the correlation you're using. I don't know of any definitions of correlation that gives sensible numbers when correlating empty and non-empty signals.

  • The simplest thing that I could find was using the scipy.stats package

    In [8]: x
    Out[8]: 
    array([[ 0. ,  0. ,  0. ],
           [-1. ,  0. , -1. ],
           [-2. ,  0. , -2. ],
           [-3. ,  0. , -3. ],
           [-4. ,  0.1, -4. ]])
    In [9]: y
    Out[9]: 
    array([[0. , 0. ],
           [1. , 0. ],
           [2. , 0. ],
           [3. , 0. ],
           [4. , 0.1]])
    
    In [10]: import scipy.stats
    
    In [27]: (scipy.stats.cov(y,x)
              /(numpy.sqrt(scipy.stats.var(y,axis=0)[:,numpy.newaxis]))
              /(numpy.sqrt(scipy.stats.var(x,axis=0))))
    Out[27]: 
    array([[-1.        ,  0.70710678, -1.        ],
           [-0.70710678,  1.        , -0.70710678]])
    

    These aren't the numbers you got, but you've mixed up your rows. (Element [0,0] should be 1.)

    A more complicated, but purely numpy solution is

    In [40]: numpy.corrcoef(x.T,y.T)[numpy.arange(x.shape[1])[numpy.newaxis,:]
                                     ,numpy.arange(y.shape[1])[:,numpy.newaxis]]
    Out[40]: 
    array([[-1.        ,  0.70710678, -1.        ],
           [-0.70710678,  1.        , -0.70710678]])
    

    This will be slower because it computes the correlation of each element in x with each other element in x, which you don't want. Also, the advanced indexing techniques used to get the subset of the array you desire can make your head hurt.

    If you're going to use numpy intensely, get familiar with the rules on broadcasting and indexing. They will help you push as much down to the C-level as possible.

    YGA : I've updated the question with the "right" inputs -- prob. makes sense to update the response just so as not to confuse people :-)
    AFoglia : Done. I've also added links to some helpful documentation.
  • Will this do what you want?

    correlations = dot(transpose(a), b)
    

0 comments:

Post a Comment