Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extrapolation -- awk based

Tags:

linux

shell

awk

I need help in the following: I have a data file (columns separated by "\t" tabular) like this data.dat

    # y1    y2      y3      y4
    17.1685 21.6875 20.2393 26.3158 

These are x values of 4 points for a linear fit. The four y values are constant: 0, 200, 400, 600.

I can create a linear fit of the point pairs (x,y): (x1,y1)=(17.1685,0), (x2,y2)=(21.6875,200), (x3,y3)=(20.2393,400), (x4,y4)=(26.3158,600).

Now I would like to make a linear fit on three of these point paris, (x1,y1), (x2,y2), (x3,y3) and (x2,y2), (x3,y3), (x4,y4) and (x1,y1), (x3,y3), (x4,y4) and (x1,y1), (x2,y2), (x4,y4).

If I have these three of points with a linear fit I would like to know the value of the x value of the extrapolated point being out of these three fitted points.

I have so far this awk code:

#!/usr/bin/awk -f

BEGIN{
  z[1] = 0;
  z[2] = 200;
  z[3] = 400;
  z[4] = 600;
}

{
  split($0,str,"\t");
  n = 0.0;

  for(i=1; i<=NF; i++)
  {
    centr[i] = str[i];
    n += 1.0;
  # printf("%d\t%f\t%.1f\t",i,centr[i],z[i]);
  }
  # print "";

  if (n > 2)
  {
    lsq(n,z,centr);
  }
}

function lsq(n,x,y)
{
  sx  = 0.0
  sy  = 0.0
  sxx = 0.0
  syy = 0.0
  sxy = 0.0
  eps = 0.0

  for (i=1;i<=n;i++)
  {
    sx  += x[i]
    sy  += y[i]
    sxx += x[i]*x[i]
    sxy += x[i]*y[i]
    syy += y[i]*y[i]
  }

  if ( (n==0) || ((n*sxx-sx*sx)==0) )
  {
    next;
  }
#   print "number of data points = " n;
  a = (sxx*sy-sxy*sx)/(n*sxx-sx*sx)
  b = (n*sxy-sx*sy)/(n*sxx-sx*sx)

  for(i=1;i<=n;i++)
  {
    ycalc[i] = a+b*x[i]
    dy[i]    = y[i]-ycalc[i]
    eps     += dy[i]*dy[i]
  }

  print "# Intercept =\t"a"
  print "# Slope     =\t"b"

  for (i=1;i<=n;i++)
  {
    printf("%8g %8g %8g \n",x[i],y[i],ycalc[i])
  }

} # function lsq()

So,

    If we extrapolate to the place of 4th 
    0   17.1685   <--(x1,y1)
    200 21.6875   <--(x2,y2)
    400 20.2393   <--(x3,y3)
    600 22.7692 <<< (x4 = 600,y1 = 22.7692)

    If we extrapolate to the place of 3th
    0   17.1685   <--(x1,y1)
    200 21.6875   <--(x2,y2)
    400 23.6867 <<< (x3 = 400,y3 = 23.6867)
    600 26.3158   <--(x4,y4)

    0   17.1685
    200 19.35266 <<<
    400 20.2393
    600 26.3158

    0   18.1192 <<<
    200 21.6875
    400 20.2393
    600 26.3158

My current output is the following:

$> ./prog.awk data.dat
# Intercept =   17.4537
# Slope     =   0.0129968   
       0  17.1685  17.4537 
     200  21.6875  20.0531 
     400  20.2393  22.6525 
     600  26.3158  25.2518 
like image 227
user1116360 Avatar asked Jan 18 '12 13:01

user1116360


1 Answers

Assuming the core calculation in the lsq function is OK (it looks about right, but I haven't scrutinized it), then that gives you the slope and intercept for the least sum of squares line of best fit for the input data set (parameters x, y, n). I'm not sure I understand the tail end of the function.

For your 'take three points and calculate the fourth' problem, the simplest way is to generate the 4 subsets (logically, by deleting one point from the set of four on each of four calls), and redo the calculation.

You need to call another function that takes the line data (slope, intercept) from lsq and interpolates (extrapolates) the value at another y value. That's a straight-forward calculation (x = m * y + c), but you need to determine which y value is missing from the set of 3 you pass in.

You could 'optimize' (meaning 'complicate') this scheme by dropping one value at a time from the 'sums of squares' and 'sums' and 'sum of products' values, recalculating the slope, intercept, and then calculating the missing point again.

(I'll also observe that normally it would be the x-coordinates with the fixed values 0, 200, 400, 600 and the y-coordinates would be the values read. However, that's just a matter of orientation, so it is not crucial.)


Here's at least plausibly working code. Since awk automatically splits on white space, there's no need for you to split on tabs specifically; the read loop takes this into account.

The code needs serious refactoring; there is a ton of repetition in it - however, I also have a job that I'm supposed to do.

#!/usr/bin/awk -f
BEGIN{
  z[1] = 0;
  z[2] = 200;
  z[3] = 400;
  z[4] = 600;
}

{
  for (i = 1; i <= NF; i++)
  {
    centr[i] = $i
  }

  if (NF > 2)
  {
    lsq(NF, z, centr);
  }
}

function lsq(n, x, y)
{
  if (n == 0) return

  sx  = 0.0
  sy  = 0.0
  sxx = 0.0
  syy = 0.0
  sxy = 0.0

  for (i = 1; i <= n; i++)
  {
    print "x[" i "] = " x[i] ", y[" i "] = " y[i]
    sx  += x[i]
    sy  += y[i]
    sxx += x[i]*x[i]
    sxy += x[i]*y[i]
    syy += y[i]*y[i]
  }

  if ((n*sxx - sx*sx) == 0) return

#   print "number of data points = " n;
  a = (sxx*sy-sxy*sx)/(n*sxx-sx*sx)
  b = (n*sxy-sx*sy)/(n*sxx-sx*sx)

  for (i = 1; i <= n; i++)
  {
    ycalc[i] = a+b*x[i]
  }

  print "# Intercept = " a
  print "# Slope     = " b
  print "Line: x = " a " + " b " * y"

  for (i = 1; i <= n; i++)
  {
    printf("x = %8g, yo = %8g, yc = %8g\n", x[i], y[i], ycalc[i])
  }

  print ""
  print "Different subsets\n"

  for (drop = 1; drop <= n; drop++)
  {
    print "Subset " drop
    sx = sy = sxx = sxy = syy = 0
    j = 1
    for (i = 1; i <= n; i++)
    {
      if (i == drop) continue
      print "x[" j "] = " x[i] ", y[" j "] = " y[i]
      sx  += x[i]
      sy  += y[i]
      sxx += x[i]*x[i]
      sxy += x[i]*y[i]
      syy += y[i]*y[i]
      j++
    }
    if (((n-1)*sxx - sx*sx) == 0) continue
    a = (sxx*sy-sxy*sx)/((n-1)*sxx-sx*sx)
    b = ((n-1)*sxy-sx*sy)/((n-1)*sxx-sx*sx)
    print "Line: x = " a " + " b " * y"

    xt = x[drop]
    yt = a + b * xt;
    print "Interpolate: x = " xt ", y = " yt
  }
}

Since awk doesn't provide an easy way to pass back multiple values from a function, nor does it provide structures other than arrays (sometimes associative), it is not perhaps the best language for this task. On the other hand, it can be made to do the job. You might be able to bundle the Least Squares calculation in a function that returns an array containing the slope and intercept, and then use that. Your turn to explore options.

Given the script lsq.awk and the input file lsq.data shown, I get the output shown:

$ cat lsq.data
17.1685 21.6875 20.2393 26.3158
$ awk -f lsq.awk lsq.data
x[1] = 0, y[1] = 17.1685
x[2] = 200, y[2] = 21.6875
x[3] = 400, y[3] = 20.2393
x[4] = 600, y[4] = 26.3158
# Intercept = 17.4537
# Slope     = 0.0129968
Line: x = 17.4537 + 0.0129968 * y
x =        0, yo =  17.1685, yc =  17.4537
x =      200, yo =  21.6875, yc =  20.0531
x =      400, yo =  20.2393, yc =  22.6525
x =      600, yo =  26.3158, yc =  25.2518

Different subsets

Subset 1
x[1] = 200, y[1] = 21.6875
x[2] = 400, y[2] = 20.2393
x[3] = 600, y[3] = 26.3158
Line: x = 18.1192 + 0.0115708 * y
Interpolate: x = 0, y = 18.1192
Subset 2
x[1] = 0, y[1] = 17.1685
x[2] = 400, y[2] = 20.2393
x[3] = 600, y[3] = 26.3158
Line: x = 16.5198 + 0.0141643 * y
Interpolate: x = 200, y = 19.3526
Subset 3
x[1] = 0, y[1] = 17.1685
x[2] = 200, y[2] = 21.6875
x[3] = 600, y[3] = 26.3158
Line: x = 17.7985 + 0.0147205 * y
Interpolate: x = 400, y = 23.6867
Subset 4
x[1] = 0, y[1] = 17.1685
x[2] = 200, y[2] = 21.6875
x[3] = 400, y[3] = 20.2393
Line: x = 18.163 + 0.007677 * y
Interpolate: x = 600, y = 22.7692
$

Edit: In the previous version of the answer, the subsets were multiplying by n instead of (n-1). The values in the revised output seem to agree with what you expect. The residual issues are presentational, not computational.

like image 82
Jonathan Leffler Avatar answered Sep 23 '22 07:09

Jonathan Leffler