Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Brain-Dead MySQL Select Optimization (Using Temporary, Using Filesort)

I'm currently working on a project involving Patents pulled the USPTO website, as part of this project I'm using the database created by people from the University of Illinois
(Paper: http://abel.lis.illinois.edu/UPDC/USPTOPatentsDatabaseConstruction.pdf)
(Slightly outdated Schema of the Tables I'm using, only missing non-index/key values: http://i.imgur.com/44LHS3L.png)

Now as the title says I'm trying to optimize a query:

SELECT 
        PN,
        AN,
        grants.GrantID,
        grants.FileDate,
        grants.IssueDate,
        grants.Kind,
        grants.ApplicationID,
        assignee_g.OrgName,
        GROUP_CONCAT(DISTINCT CONCAT_WS(', ', assignee_g.City, assignee_g.State, assignee_g.Country) separator ';') as Assignee,
        GROUP_CONCAT(DISTINCT CONCAT_WS(', ', inventor_g.FirstName, inventor_g.LastName) separator ';') as Inventor,
        GROUP_CONCAT(DISTINCT CONCAT_WS(', ', inventor_g.City, inventor_g.State, inventor_g.Country) separator ';') as Inventor_address,
        GROUP_CONCAT(DISTINCT CONCAT_WS(', ', usclass_g.Class, usclass_g.Subclass) separator ';') as USClass,
        intclass_g.Section,
        intclass_g.Class,
        intclass_g.Subclass,
        intclass_g.MainGroup,
        intclass_g.SubGroup
FROM
    (
    SELECT grants.GrantID as CitingID, CitedID as PN, grants2.ApplicationID AS AN
    FROM
        gracit_g, grants, grants as grants2
    Where
        grants.GrantID IN (*A 
                                         couple 
                                           Thousand
                                              keys*)
            and grants.GrantID = gracit_g.GrantID and grants2.GrantID = CitedID 
    LIMIT 500000) tbl1,
             grants, assignee_g, inventor_g, usclass_g, intclass_g
WHERE
    grants.GrantID = tbl1.CitingID
        and grants.GrantID = assignee_g.GrantID
        and grants.GrantID = inventor_g.GrantID
        and grants.GrantID = usclass_g.GrantID
        and grants.GrantID = intclass_g.GrantID
GROUP BY PN, GrantID
LIMIT 50000000

Pretty much for each Patent Being cited by one that came after it I want to record the information of the patent citing it. The problem I seem to be running into is that my "GROUP BY PN, GrantID" is causing "Using Temporary, Using Filesort" which is severely slowing down my endeavours.

This is what my explain gave to me (sorry if it's not perfectly formatted, I couldn't find how to make a table)

1
PRIMARY
derived2
ALL
8716
possible key: null
key: null
key_len: null
ref: null
Using temporary; Using filesort


1
PRIMARY
grants
eq_ref
PRIMARY
PRIMARY
62
tbl1.CitingID
1


1
PRIMARY
assignee_g
ref
PRIMARY,FK_PublicationID_PUBLICATION_ASSIGNEE_P
PRIMARY
62
tbl1.CitingID
1


1
PRIMARY
intclass_g
ref
PRIMARY,fk_publicationid_PUBLICATION_INTERNATIONALCLASS_P
PRIMARY
62
tbl1.CitingID
1


1
PRIMARY
inventor_g
ref
PRIMARY,fk_PublicationID_Inventor_p
PRIMARY
62
tbl1.CitingID
1


1
PRIMARY
usclass_g
ref
PRIMARY,fk_publicationid_PUBLICATION_USCLASS_P
PRIMARY
62
tbl1.CitingID
2


2
DERIVED
grants
range
PRIMARY
PRIMARY
62
ref: null
2179
Using where; Using index


2
DERIVED
gracit_g
ref
PRIMARY,FK_PublicationID_PUBLICATION_PCITATION_P,CitedID
PRIMARY
62
uspto_patents.grants.GrantID
4
Using where


2
DERIVED
grants2
eq_ref
PRIMARY
PRIMARY
62
uspto_patents.gracit_g.CitedID
1


The SHOW CREATE for gracit_g is:

CREATE TABLE `gracit_g` (
`GrantID` varchar(20) NOT NULL,
`Position` int(11) NOT NULL,
`CitedID` varchar(20) DEFAULT NULL,
`Kind` varchar(10) DEFAULT NULL COMMENT 'identify whether citedDoc is a document or foreign patent',
`Name` varchar(100) DEFAULT NULL,
`Date` date DEFAULT NULL,
`Country` varchar(100) DEFAULT NULL,
`Category` varchar(100) DEFAULT NULL,
PRIMARY KEY (`GrantID`,`Position`),
KEY `FK_PublicationID_PUBLICATION_PCITATION_P` (`GrantID`),
KEY `CitedID` (`CitedID`),
CONSTRAINT `FK_GrantID_GRANT_PCITATION_G0` FOREIGN KEY (`GrantID`) REFERENCES `grants`   (`GrantID`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8

The SHOW CREATE for grants is:

 CREATE TABLE `grants` (
 `GrantID` varchar(20) NOT NULL,
 `Title` varchar(500) DEFAULT NULL,
 `IssueDate` date DEFAULT NULL,
 `Kind` varchar(2) DEFAULT NULL,
 `USSeriesCode` varchar(2) DEFAULT NULL,
 `Abstract` text,
 `ClaimsNum` int(11) DEFAULT NULL,
 `DrawingsNum` int(11) DEFAULT NULL,
 `FiguresNum` int(11) DEFAULT NULL,
 `ApplicationID` varchar(20) NOT NULL,
 `Claims` text,
 `FileDate` date DEFAULT NULL,
 `AppType` varchar(45) DEFAULT NULL,
 `AppNoOrig` varchar(10) DEFAULT NULL,
 `SourceName` varchar(100) DEFAULT NULL,
 PRIMARY KEY (`GrantID`)
 ) ENGINE=InnoDB DEFAULT CHARSET=utf8

Thank you very much for your time, unfortunately I must retire to my bed as it is far too late (or early at this point) for me to continue working at this for now)



One suggestion was to change it into 1 query instead of having a subquery:

  SELECT 
        gracit_g.citedID,
        info_grant.GrantID,
        info_grant.FileDate,
        info_grant.IssueDate,
        info_grant.Kind,
        info_grant.ApplicationID,
        assignee_g.OrgName,
        GROUP_CONCAT(DISTINCT CONCAT_WS(', ', assignee_g.City, assignee_g.State, assignee_g.Country) separator ';') as Assignee,
        GROUP_CONCAT(DISTINCT CONCAT_WS(', ', inventor_g.FirstName, inventor_g.LastName) separator ';') as Inventor,
        GROUP_CONCAT(DISTINCT CONCAT_WS(', ', inventor_g.City, inventor_g.State, inventor_g.Country) separator ';') as Inventor_address,
        GROUP_CONCAT(DISTINCT CONCAT_WS(', ', usclass_g.Class, usclass_g.Subclass) separator ';') as USClass,
        intclass_g.Section,
        intclass_g.Class,
        intclass_g.Subclass,
        intclass_g.MainGroup,
        intclass_g.SubGroup
FROM
    gracit_g, grants as info_grant, assignee_g, inventor_g, usclass_g, intclass_g
WHERE
        gracit_g.GrantID IN (*KEYS*)
        and info_grant.GrantID = gracit_g.GrantID
        and info_grant.GrantID = assignee_g.GrantID
        and info_grant.GrantID = inventor_g.GrantID
        and info_grant.GrantID = usclass_g.GrantID
        and info_grant.GrantID = intclass_g.GrantID
GROUP BY gracit_g.citedID, info_grant.GrantID
LIMIT 50000000

This has cut it down from 21s duration/10s fetch to 13s duration/8s fetch which I would still want to improve as I have many Keys to pass through.

like image 957
Jonathan Boisvert Avatar asked Nov 02 '22 20:11

Jonathan Boisvert


1 Answers

Your query is in the form:

SELECT some_fields
FROM (
    SELECT other_fields
    FROM table1, table2
    WHERE join_condition_table1_table2 AND some_other_condition
) AS subquery, table3
WHERE join_condition_subquery_table3
GROUP BY another_field

You need to rewrite it as follows:

SELECT some_fields
FROM table1, table2, table3
WHERE
    join_condition_table1_table2
    AND join_condition_subquery_table3 -- actually rewrite this ans a join of either table1 and table3, or table2 and table3
    AND some_other_condition
GROUP BY another_field

As pointed out by @Ollie Jones, it is dangerous to select fields (in the SELECT clause) which are neither part of the GROUP BY condition nor inside an agregating function. If these fields do not uniquely depend on the fields in the GROUP BY condition, the values of these fields are undefined.

[edit]

A few more suggestions:

  • add an index on gracit_g(citedID, GrantID) in this order (ALTER TABLE gracit_g ADD INDEX(citedID, GrantID);) and change your GROUP BY clause to GROUP BY gracit_g.citedID, gracit_g.GrantID. The optimizer may fancy using this index to compute the GROUP BY clause.

  • if your VARCHAR primary keys are actually numbers, change their type to a suitable integer type. If not, add a numeric surrogate key and use it as a primary key. Integer comparisions are much quicker, and you do a lot of comparisions in all your joins.

  • precompute the concatenated values like CONCAT_WS(', ', assignee_g.City, assignee_g.State, assignee_g.Country) separator ';') in an extra column, or an extra table (the latter will then require an extra join per table)

  • increase the tmp_table_sizeand max_heap_table_size server options. If the temporary table grows larger than either of these two values (in bytes) then the temporary table cannot be held in memory and will be written to disk. You may benefit from unusually large values here, since you are dealing with an unusually large result set.

I don't know if there is anything else to be done. You might need to consider returning a smaller result set (less columns, or more filters, or a smaller LIMIT).

like image 90
RandomSeed Avatar answered Nov 11 '22 21:11

RandomSeed