I'm currently working on a project involving Patents pulled the USPTO website, as part of this project I'm using the database created by people from the University of Illinois
(Paper: http://abel.lis.illinois.edu/UPDC/USPTOPatentsDatabaseConstruction.pdf)
(Slightly outdated Schema of the Tables I'm using, only missing non-index/key values: http://i.imgur.com/44LHS3L.png)
Now as the title says I'm trying to optimize a query:
SELECT
PN,
AN,
grants.GrantID,
grants.FileDate,
grants.IssueDate,
grants.Kind,
grants.ApplicationID,
assignee_g.OrgName,
GROUP_CONCAT(DISTINCT CONCAT_WS(', ', assignee_g.City, assignee_g.State, assignee_g.Country) separator ';') as Assignee,
GROUP_CONCAT(DISTINCT CONCAT_WS(', ', inventor_g.FirstName, inventor_g.LastName) separator ';') as Inventor,
GROUP_CONCAT(DISTINCT CONCAT_WS(', ', inventor_g.City, inventor_g.State, inventor_g.Country) separator ';') as Inventor_address,
GROUP_CONCAT(DISTINCT CONCAT_WS(', ', usclass_g.Class, usclass_g.Subclass) separator ';') as USClass,
intclass_g.Section,
intclass_g.Class,
intclass_g.Subclass,
intclass_g.MainGroup,
intclass_g.SubGroup
FROM
(
SELECT grants.GrantID as CitingID, CitedID as PN, grants2.ApplicationID AS AN
FROM
gracit_g, grants, grants as grants2
Where
grants.GrantID IN (*A
couple
Thousand
keys*)
and grants.GrantID = gracit_g.GrantID and grants2.GrantID = CitedID
LIMIT 500000) tbl1,
grants, assignee_g, inventor_g, usclass_g, intclass_g
WHERE
grants.GrantID = tbl1.CitingID
and grants.GrantID = assignee_g.GrantID
and grants.GrantID = inventor_g.GrantID
and grants.GrantID = usclass_g.GrantID
and grants.GrantID = intclass_g.GrantID
GROUP BY PN, GrantID
LIMIT 50000000
Pretty much for each Patent Being cited by one that came after it I want to record the information of the patent citing it. The problem I seem to be running into is that my "GROUP BY PN, GrantID" is causing "Using Temporary, Using Filesort" which is severely slowing down my endeavours.
This is what my explain gave to me (sorry if it's not perfectly formatted, I couldn't find how to make a table)
1
PRIMARY
derived2
ALL
8716
possible key: null
key: null
key_len: null
ref: null
Using temporary; Using filesort
1
PRIMARY
grants
eq_ref
PRIMARY
PRIMARY
62
tbl1.CitingID
1
1
PRIMARY
assignee_g
ref
PRIMARY,FK_PublicationID_PUBLICATION_ASSIGNEE_P
PRIMARY
62
tbl1.CitingID
1
1
PRIMARY
intclass_g
ref
PRIMARY,fk_publicationid_PUBLICATION_INTERNATIONALCLASS_P
PRIMARY
62
tbl1.CitingID
1
1
PRIMARY
inventor_g
ref
PRIMARY,fk_PublicationID_Inventor_p
PRIMARY
62
tbl1.CitingID
1
1
PRIMARY
usclass_g
ref
PRIMARY,fk_publicationid_PUBLICATION_USCLASS_P
PRIMARY
62
tbl1.CitingID
2
2
DERIVED
grants
range
PRIMARY
PRIMARY
62
ref: null
2179
Using where; Using index
2
DERIVED
gracit_g
ref
PRIMARY,FK_PublicationID_PUBLICATION_PCITATION_P,CitedID
PRIMARY
62
uspto_patents.grants.GrantID
4
Using where
2
DERIVED
grants2
eq_ref
PRIMARY
PRIMARY
62
uspto_patents.gracit_g.CitedID
1
The SHOW CREATE for gracit_g is:
CREATE TABLE `gracit_g` (
`GrantID` varchar(20) NOT NULL,
`Position` int(11) NOT NULL,
`CitedID` varchar(20) DEFAULT NULL,
`Kind` varchar(10) DEFAULT NULL COMMENT 'identify whether citedDoc is a document or foreign patent',
`Name` varchar(100) DEFAULT NULL,
`Date` date DEFAULT NULL,
`Country` varchar(100) DEFAULT NULL,
`Category` varchar(100) DEFAULT NULL,
PRIMARY KEY (`GrantID`,`Position`),
KEY `FK_PublicationID_PUBLICATION_PCITATION_P` (`GrantID`),
KEY `CitedID` (`CitedID`),
CONSTRAINT `FK_GrantID_GRANT_PCITATION_G0` FOREIGN KEY (`GrantID`) REFERENCES `grants` (`GrantID`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8
The SHOW CREATE for grants is:
CREATE TABLE `grants` (
`GrantID` varchar(20) NOT NULL,
`Title` varchar(500) DEFAULT NULL,
`IssueDate` date DEFAULT NULL,
`Kind` varchar(2) DEFAULT NULL,
`USSeriesCode` varchar(2) DEFAULT NULL,
`Abstract` text,
`ClaimsNum` int(11) DEFAULT NULL,
`DrawingsNum` int(11) DEFAULT NULL,
`FiguresNum` int(11) DEFAULT NULL,
`ApplicationID` varchar(20) NOT NULL,
`Claims` text,
`FileDate` date DEFAULT NULL,
`AppType` varchar(45) DEFAULT NULL,
`AppNoOrig` varchar(10) DEFAULT NULL,
`SourceName` varchar(100) DEFAULT NULL,
PRIMARY KEY (`GrantID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
Thank you very much for your time, unfortunately I must retire to my bed as it is far too late (or early at this point) for me to continue working at this for now)
One suggestion was to change it into 1 query instead of having a subquery:
SELECT
gracit_g.citedID,
info_grant.GrantID,
info_grant.FileDate,
info_grant.IssueDate,
info_grant.Kind,
info_grant.ApplicationID,
assignee_g.OrgName,
GROUP_CONCAT(DISTINCT CONCAT_WS(', ', assignee_g.City, assignee_g.State, assignee_g.Country) separator ';') as Assignee,
GROUP_CONCAT(DISTINCT CONCAT_WS(', ', inventor_g.FirstName, inventor_g.LastName) separator ';') as Inventor,
GROUP_CONCAT(DISTINCT CONCAT_WS(', ', inventor_g.City, inventor_g.State, inventor_g.Country) separator ';') as Inventor_address,
GROUP_CONCAT(DISTINCT CONCAT_WS(', ', usclass_g.Class, usclass_g.Subclass) separator ';') as USClass,
intclass_g.Section,
intclass_g.Class,
intclass_g.Subclass,
intclass_g.MainGroup,
intclass_g.SubGroup
FROM
gracit_g, grants as info_grant, assignee_g, inventor_g, usclass_g, intclass_g
WHERE
gracit_g.GrantID IN (*KEYS*)
and info_grant.GrantID = gracit_g.GrantID
and info_grant.GrantID = assignee_g.GrantID
and info_grant.GrantID = inventor_g.GrantID
and info_grant.GrantID = usclass_g.GrantID
and info_grant.GrantID = intclass_g.GrantID
GROUP BY gracit_g.citedID, info_grant.GrantID
LIMIT 50000000
This has cut it down from 21s duration/10s fetch to 13s duration/8s fetch which I would still want to improve as I have many Keys to pass through.
Your query is in the form:
SELECT some_fields
FROM (
SELECT other_fields
FROM table1, table2
WHERE join_condition_table1_table2 AND some_other_condition
) AS subquery, table3
WHERE join_condition_subquery_table3
GROUP BY another_field
You need to rewrite it as follows:
SELECT some_fields
FROM table1, table2, table3
WHERE
join_condition_table1_table2
AND join_condition_subquery_table3 -- actually rewrite this ans a join of either table1 and table3, or table2 and table3
AND some_other_condition
GROUP BY another_field
As pointed out by @Ollie Jones, it is dangerous to select fields (in the SELECT
clause) which are neither part of the GROUP BY
condition nor inside an agregating function. If these fields do not uniquely depend on the fields in the GROUP BY
condition, the values of these fields are undefined.
[edit]
A few more suggestions:
add an index on gracit_g(citedID, GrantID)
in this order (ALTER TABLE gracit_g ADD INDEX(citedID, GrantID);
) and change your GROUP BY
clause to GROUP BY gracit_g.citedID, gracit_g.GrantID
. The optimizer may fancy using this index to compute the GROUP BY
clause.
if your VARCHAR
primary keys are actually numbers, change their type to a suitable integer type. If not, add a numeric surrogate key and use it as a primary key. Integer comparisions are much quicker, and you do a lot of comparisions in all your joins.
precompute the concatenated values like CONCAT_WS(', ', assignee_g.City, assignee_g.State, assignee_g.Country) separator ';')
in an extra column, or an extra table (the latter will then require an extra join per table)
increase the tmp_table_size
and max_heap_table_size
server options. If the temporary table grows larger than either of these two values (in bytes) then the temporary table cannot be held in memory and will be written to disk. You may benefit from unusually large values here, since you are dealing with an unusually large result set.
I don't know if there is anything else to be done. You might need to consider returning a smaller result set (less columns, or more filters, or a smaller LIMIT
).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With