Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I get box plot key numbers from an array in PHP?

Tags:

php

boxplot

Say I have an array with values like:

$values = array(48,30,97,61,34,40,51,33,1);

And I want the values to be able to plot a box plot like follows:

$box_plot_values = array(
    'lower_outlier'  => 1,
    'min'            => 8,
    'q1'             => 32,
    'median'         => 40,
    'q3'             => 56,
    'max'            => 80,
    'higher_outlier' => 97,
);

How would I do this in PHP?

like image 614
Lilleman Avatar asked Oct 06 '13 14:10

Lilleman


People also ask

What information could you gain from a box plot?

Box plots are useful as they provide a visual summary of the data enabling researchers to quickly identify mean values, the dispersion of the data set, and signs of skewness.

How do you make a Boxplot for each feature in a data set?

To draw a box plot for the given data first we need to arrange the data in ascending order and then find the minimum, first quartile, median, third quartile and the maximum. To find the First Quartile we take the first six values and find their median. For the Third Quartile, we take the next six and find their median.

How do you read a box plot?

How to Read a Box Plot. A boxplot is a way to show a five number summary in a chart. The main part of the chart (the “box”) shows where the middle portion of the data is: the interquartile range. At the ends of the box, you” find the first quartile (the 25% mark) and the third quartile (the 75% mark).

What is Boxplot in data mining?

What Is a Boxplot? A boxplot is a standardized way of displaying the distribution of data based on a five number summary (“minimum”, first quartile [Q1], median, third quartile [Q3] and “maximum”). It can tell you about your outliers and what their values are.


2 Answers

function box_plot_values($array)
{
    $return = array(
        'lower_outlier'  => 0,
        'min'            => 0,
        'q1'             => 0,
        'median'         => 0,
        'q3'             => 0,
        'max'            => 0,
        'higher_outlier' => 0,
    );

    $array_count = count($array);
    sort($array, SORT_NUMERIC);

    $return['min']            = $array[0];
    $return['lower_outlier']  = $return['min'];
    $return['max']            = $array[$array_count - 1];
    $return['higher_outlier'] = $return['max'];
    $middle_index             = floor($array_count / 2);
    $return['median']         = $array[$middle_index]; // Assume an odd # of items
    $lower_values             = array();
    $higher_values            = array();

    // If we have an even number of values, we need some special rules
    if ($array_count % 2 == 0)
    {
        // Handle the even case by averaging the middle 2 items
        $return['median'] = round(($return['median'] + $array[$middle_index - 1]) / 2);

        foreach ($array as $idx => $value)
        {
            if ($idx < ($middle_index - 1)) $lower_values[]  = $value; // We need to remove both of the values we used for the median from the lower values
            elseif ($idx > $middle_index)   $higher_values[] = $value;
        }
    }
    else
    {
        foreach ($array as $idx => $value)
        {
            if ($idx < $middle_index)     $lower_values[]  = $value;
            elseif ($idx > $middle_index) $higher_values[] = $value;
        }
    }

    $lower_values_count = count($lower_values);
    $lower_middle_index = floor($lower_values_count / 2);
    $return['q1']       = $lower_values[$lower_middle_index];
    if ($lower_values_count % 2 == 0)
        $return['q1'] = round(($return['q1'] + $lower_values[$lower_middle_index - 1]) / 2);

    $higher_values_count = count($higher_values);
    $higher_middle_index = floor($higher_values_count / 2);
    $return['q3']        = $higher_values[$higher_middle_index];
    if ($higher_values_count % 2 == 0)
        $return['q3'] = round(($return['q3'] + $higher_values[$higher_middle_index - 1]) / 2);

    // Check if min and max should be capped
    $iqr = $return['q3'] - $return['q1']; // Calculate the Inner Quartile Range (iqr)
    if ($return['q1'] > $iqr)                  $return['min'] = $return['q1'] - $iqr;
    if ($return['max'] - $return['q3'] > $iqr) $return['max'] = $return['q3'] + $iqr;

    return $return;
}
like image 55
Lilleman Avatar answered Oct 01 '22 21:10

Lilleman


Lilleman's code is brilliant. I really appreciate his way to deal with median and q1/q3. If I were answering this first, I would be cope with odd and even amount of values in a harder but unnecessary way. I mean useing if 4 times for 4 different situationgs of mode( count( values ) , 4 ). But his way is just neat and tidy. I really admires his work.

I would like to make some improvemenets about max, min, higher_outliers and lower_outliers. Because q1 - 1.5*IQR is only the lower bound, we should find the least value that greater than this bound as the 'min'. This is the same for 'max'. Also, there might be more than one outliers. So I would like to make some changes based on Lilleman's work. Thanks.

function box_plot_values($array)
{
     $return = array(
    'lower_outlier'  => 0,
    'min'            => 0,
    'q1'             => 0,
    'median'         => 0,
    'q3'             => 0,
    'max'            => 0,
    'higher_outlier' => 0,
);

$array_count = count($array);
sort($array, SORT_NUMERIC);

$return['min']            = $array[0];
$return['lower_outlier']  = array();
$return['max']            = $array[$array_count - 1];
$return['higher_outlier'] = array();
$middle_index             = floor($array_count / 2);
$return['median']         = $array[$middle_index]; // Assume an odd # of items
$lower_values             = array();
$higher_values            = array();

// If we have an even number of values, we need some special rules
if ($array_count % 2 == 0)
{
    // Handle the even case by averaging the middle 2 items
    $return['median'] = round(($return['median'] + $array[$middle_index - 1]) / 2);

    foreach ($array as $idx => $value)
    {
        if ($idx < ($middle_index - 1)) $lower_values[]  = $value; // We need to remove both of the values we used for the median from the lower values
        elseif ($idx > $middle_index)   $higher_values[] = $value;
    }
}
else
{
    foreach ($array as $idx => $value)
    {
        if ($idx < $middle_index)     $lower_values[]  = $value;
        elseif ($idx > $middle_index) $higher_values[] = $value;
    }
}

$lower_values_count = count($lower_values);
$lower_middle_index = floor($lower_values_count / 2);
$return['q1']       = $lower_values[$lower_middle_index];
if ($lower_values_count % 2 == 0)
    $return['q1'] = round(($return['q1'] + $lower_values[$lower_middle_index - 1]) / 2);

$higher_values_count = count($higher_values);
$higher_middle_index = floor($higher_values_count / 2);
$return['q3']        = $higher_values[$higher_middle_index];
if ($higher_values_count % 2 == 0)
    $return['q3'] = round(($return['q3'] + $higher_values[$higher_middle_index - 1]) / 2);

// Check if min and max should be capped
$iqr = $return['q3'] - $return['q1']; // Calculate the Inner Quartile Range (iqr)

$return['min'] = $return['q1'] - 1.5*$iqr; // This ( q1 - 1.5*IQR ) is actually the lower bound,
                                           // We must compare every value in the lower half to this.
                                           // Those less than the bound are outliers, whereas
                                           // The least one that greater than this bound is the 'min'
                                           // for the boxplot.
foreach( $lower_values as  $idx => $value )
{
    if( $value < $return['min'] )  // when values are less than the bound
    {
        $return['lower_outlier'][$idx] = $value ; // keep the index here seems unnecessary
                                                  // but those who are interested in which values are outliers 
                                                  // can take advantage of this and asort to identify the outliers
    }else
    {
        $return['min'] = $value; // when values that greater than the bound
        break;  // we should break the loop to keep the 'min' as the least that greater than the bound
    }
}

$return['max'] = $return['q3'] + 1.5*$iqr; // This ( q3 + 1.5*IQR ) is the same as previous.
foreach( array_reverse($higher_values) as  $idx => $value )
{
    if( $value > $return['max'] )
    {
        $return['higher_outlier'][$idx] = $value ;
    }else
    {
        $return['max'] = $value;
        break;
    }
}
    return $return;
}

I wish this could be helpful for those who would be interested in this issue. And Pls add comment to me if there is a better way to know which values are the outliers. Thanks!

like image 34
ShaoE Avatar answered Oct 01 '22 20:10

ShaoE