I have two very large tables to merge and so I have been trying to optomize the update for speed. I noticed that doing the update partially in PHP speeded it up significantly so I assumed this means I'm not be doing the MySQL properly.
I have simplified the problem to try and narrow it down ...
GRID_TABLE POSTCODE_TABLE idNo, lat, lng, nearestPostcode postcode, lat, lng ________________________________ _____________________ 1 57.1 -2.3 - AB12 3BA 56.3 -2.5 2 56.8 -1.9 - AB12 1YA 56.2 -2.3 . . . . . . (200 entries) (35,000 entries)
I want to update the GRID_TABLE with the nearestPostcode from the POSTCODE_TABLE using latitude (lat) and longitude (lng) to find the nearest postcode to each grid point...
update grid_table set nearestPostcode = (
    select postcode from postcode_table 
    where lat > grid_table.lat -0.0037 and lat < grid_table.lat +0.0037 
        and lng > grid_table.lng -0.0068 and lng < grid_table.lng +0.0068
    order by POW(lat - grid_table.lat,2) + POW((lng - grid_table.lng) *0.546,2) 
    limit 1 
    )
The idea is that the 'where' clause speeds up the search by using indexes to narrow the set down to a few candidates and then the 'order by' clause finds the nearest one within this set.
This MySQL update takes 30 secs, but if I instead update each GRID_TABLE row individually in PHP it's over in the blink of an eye.
$queryStg = "select * from grid_table ;";
$sqlQuery1 = mysqli_query($mysqliLink, $queryStg);
while( $sqlRow = mysqli_fetch_assoc( $sqlQuery1 ) ) {
    $idNo = $sqlRow['idNo'];
    $lat = $sqlRow['lat'];
    $lng = $sqlRow['lng'];
    $queryStg = "
        update grid_table
            set nearestPostcode = (
                SELECT postcode
                FROM postcode_table
                where
                    lat > " . ($lat - 0.0037) . " and
                    lat < " . ($lat + 0.0037) . " and
                    lng > " . ($lng - 0.0068) . " and
                    lng < " . ($lng + 0.0068) . "
                ORDER BY
                    POW(lat - $lat, 2) +
                    POW((lng - $lng) * 0.546, 2)
                    ASC
                limit 1
                )
            where idNo= $idNo;
        ";
    $sqlQuery2 = mysqli_query($mysqliLink, $queryStg);
}
Surely the MySQL version should be faster than the PHP version?
Here is the MySQL for the tables...
CREATE TABLE `grid_table` (
    `idNo` INT(11) NOT NULL AUTO_INCREMENT,
    `lat` FLOAT(6,4) NOT NULL COMMENT 'latitude',
    `lng` FLOAT(6,4) NOT NULL COMMENT 'longitude',
    `nearestPostcode` CHAR(8) NOT NULL,
    PRIMARY KEY (`idNo`),
    INDEX `lat_lng` (`lat`, `lng`)
)
ENGINE=MyISAM
ROW_FORMAT=DEFAULT
AUTO_INCREMENT=30047
CREATE TABLE `postcode_table` (
    `postcode` CHAR(8) NOT NULL,
    `lat` FLOAT(6,4) NOT NULL COMMENT 'latitude',
    `lng` FLOAT(6,4) NOT NULL COMMENT 'longitude',
    PRIMARY KEY (`postcode`),
    INDEX `lat` (`lat`),
    INDEX `lng` (`lng`),
    INDEX `lat_lng` (`lat`, `lng`)
)
ENGINE=MyISAM
ROW_FORMAT=DEFAULT
MySQL import file is here... https://docs.google.com/leaf?id=0B93lksnTC7_cM2Y2ZDk1Y2YtMGQ3Yy00OTIxLTk0ZDAtZmE2NmQ3YTc1ZWRm&hl=en
(if you run the UPDATE, 10 nearestPostcodes will be added).
UPDATE AFTER ANSWERS...
I ran this...
explain extended
 SELECT postcode FROM postcode_table 
 where lat > 57.0 and lat < 57.0074
 and lng > -2.013 and lng < -2
 ORDER BY POW(lat - 57.0, 2) + POW((lng - -2) * 0.546, 2) ASC 
It returned...
id,select_type,table,type,possible_keys,key,key_len,ref,rows,filtered,Extra 1,SIMPLE,postcode_table,range,lat,lng,lat_lng,lat_lng,8,NULL,65,100.00,Using where; Using filesort
Removing the 'order by' caluse -> no difference in speed.
Simplifying the 'where' clause by removing 'lng', ie
where lat between grid_table.lat - 0.0037 and grid_table.lat + 0.0037-> faster: 3 secs rather than 30 secs.
Using spatial column and index (see below) -> much slower (190 sec). Not sure if I implemented this correctly though.
ALTER TABLE `grid_table` ADD COLUMN `coords` POINT NOT NULL; update grid_table set coords = POINT(lat, lng); ALTER TABLE `grid_table` ADD SPATIAL INDEX `coords` (`coords`); ALTER TABLE `postcode_table` ADD COLUMN `coords` POINT NOT NULL; update postcode_table set coords = POINT(lat, lng); ALTER TABLE `postcode_table` ADD SPATIAL INDEX `coords` (`coords`); analyze table grid_table; optimize table grid_table; analyze table postcode_table; optimize table postcode_table;
update grid_table set nearestPostcode = (
    select postcode from postcode_table 
    WHERE MBRContains(GeomFromText(concat(
         'POLYGON((', 
          grid_table.lat - 0.0037, ' ', grid_table.lng - 0.0068, ', ',
          grid_table.lat - 0.0037, ' ', grid_table.lng + 0.0068, ', ',
          grid_table.lat + 0.0037, ' ', grid_table.lng - 0.0068, ', ',
          grid_table.lat - 0.0037, ' ', grid_table.lng - 0.0068, 
          '))')), postcode_table.coords)
     order by POW(lat - grid_table.lat,2) + POW((lng - grid_table.lng) *0.546,2)
     limit 1 
     )
                In your MySQL version your subquery works with all 30000 grid_table records, whether in your PHP version -- it's only one. As you add where on outer table PK.
I suggest you here to change update query. For example, try to make it without subquery, multiple-update as here http://dev.mysql.com/doc/refman/5.0/en/update.html.
I believe it should help.
Something like:
update grid_table, postcode_table
set grid_table.nearestPostcode = postcode_table.postcode
where postcode_table.lat > grid_table.lat - 0.0037
and postcode_table.lat < grid_table.lat + 0.0037 
and postcode_table.lng > grid_table.lng - 0.0068
and lng < grid_table.lng + 0.0068
group by grid_table.idNo
having (POW(lat - grid_table.lat,2) + POW((lng - grid_table.lng) *0.546,2)) = min(POW(lat - grid_table.lat,2) + POW((lng - grid_table.lng) *0.546,2))
May be this version could help, but I`m not sure. I assume, the main root problem in your 1st version is subquery over all records.
To have explain update, you can "convert" it to similar select:
explain
select
    *,
    (
        select postcode from postcode_table
        where lat > grid_table.lat -0.0037 and lat < grid_table.lat +0.0037
            and lng > grid_table.lng -0.0068 and lng < grid_table.lng +0.0068
        order by POW(lat - grid_table.lat,2) + POW((lng - grid_table.lng) *0.546,2)
        limit 1
    ) nearestPostcode   
from grid_table
And you will see:
id  select_type table   type    possible_keys   key key_len ref rows    Extra
1   PRIMARY grid_table  ALL                 224 
2   DEPENDENT SUBQUERY  postcode_table  ALL lat,lng,lat_lng             35605   Using where; Using temporary; Using filesort
But in case of idNo we have:
explain
select
    *,
    (
        select postcode from postcode_table
        where lat > grid_table.lat -0.0037 and lat < grid_table.lat +0.0037
            and lng > grid_table.lng -0.0068 and lng < grid_table.lng +0.0068
        order by POW(lat - grid_table.lat,2) + POW((lng - grid_table.lng) *0.546,2)
        limit 1
    ) nearestPostcode   
from grid_table
where idNo = 1487;
id  select_type table   type    possible_keys   key key_len ref rows    Extra
1   PRIMARY grid_table  const   PRIMARY PRIMARY 4   const   1   
2   DEPENDENT SUBQUERY  postcode_table  range   lat,lng,lat_lng lat 4       18  Using where; Using filesort
So we have 35605 rows vs ~18*224 (~4000).
To find correct query try to find good select 1st.
Update
Subquery isn't a root here :( So I think we should try some precalculated + indexed column may be. Target is to avoid order by SOMEFUNC()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With