Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL address data is messy, how to clean it up in a query?

I have address data stored in an sql server 2000 database, and I need to pull out all the addresses for a given customer code. The problem is, there are a lot of misspelled addresses, some with missing parts, etc. So I need to clean this up somehow. I need to weed oout the bad spellings, missing parts, etc and come up with the "average" record. For example, if New York is spelled properly in 4 out of 5 records, that should be the value returned.

I can't modify the data, validate it on input, or anything like that. I can only modify a copy of the data, or manipulate it through a query.

I got a partial answer here Addresses stored in SQL server have many small variations(errors), but I need to allow for multiple valid addresses per code.

Sample Data

Code    Name                       Address1                      Address2           City            State          Zip     TimesUsed
10003   AMERICAN NUTRITON INC     2183 BALL STREET                                 OLDEN           Utah           87401     177
10003   AMEICAN NUTRITION INC     2183 BALL STREET              PO BOX 1504        OLDEN           Utah           87402     76
10003   AMERICAN NUTRITION INC    2183 BALL STREET                                 OLDEN           Utah           87402     24
10003   AMERICAN NUTRITION INC    2183 BALL STREET              PO BOX 1504        OLDEN           Utah           87402     17
10003   Samantha Brooks           506 S. Main Street                               Ellensburg      Washington     98296     1
10003   BEMIS COMPANY             1401 W. FOURTH PLAIN BLVD.                       VANCOUVER       Washington     98660     1
10003   CEI                       597 VANDYRE BOULEVARD                            WRIGHTSTOWN     Wisconsin      54180     1
10003   Pacific Pet               28th Avenue                                      OLDEN           Utah           84401     1
10003   PETSMART, INC.            16091 NORTH 25TH STREET                          PHOENA         Arizona        85027      1
10003   THE PET FIRM              16418 NORTH 37TH STREET                          PHOENA         Arizona        85503      1

Desired Output

Code    Name                      Address1                      Address2           City            State          Zip     
10003   AMERICAN NUTRITION INC    2183 BALL AVENUE                                 Olden           Utah           84401
10003   Samantha Brooks             506 S. Main Street                               Ellensburg      Washington     98296 
10003   BEMIS COMPANY             1401 W. FOURTH PLAIN BLVD.                       VANCOUVER       Washington     98660
10003   CEI                       975 VANDYKE ROAD                                 WRIGHTSTOWN     Wisconsin      54180
10003   Pacific Pet               29th Street                                      OGDEN           Utah           84401
10003   PETSMART, INC.            16091 NORTH 25TH AVENUE                          PHOENA         Arizona        85027
10003   THE PET FIRM              16418 NORTH 37TH STREET                          PHOENA         Arizona        85503
like image 202
MAW74656 Avatar asked Feb 09 '11 22:02

MAW74656


People also ask

How do you clear space in SQL?

SQL Server TRIM() Function The TRIM() function removes the space character OR other specified characters from the start or end of a string. By default, the TRIM() function removes leading and trailing spaces from a string. Note: Also look at the LTRIM() and RTRIM() functions.

How do you clear a query?

To clear query results of a view Right-click in the Results pane, point to Pane, and then click Clear Results. If a query is being executed when you clear the Results pane, the Query and View Designer stops the query.

Why should you clean data in SQL Server databases?

It allows for easy navigation and exploration of the data for further analysis. In order to learn more about how data cleaning is done in SQL, I took DataCamp’s “Cleaning Data in SQL Server Databases” course. In the course, I learned how to deal with improper string values, missing or duplicate values, converting data types, and more.

What does it mean to cleanse data in T-SQL?

The "cleanse" in this case is the vendor re-submits the data. In the below code, we use the TRY_PARSE function in T-SQL to replace invalid dates and integers with NULL values and on smaller data sets this functions well. Because we have a few records here (10,004), these try-parses execute quickly (less than a second).

How to clean and transform data in SQL?

Cleaning and Transforming Data with SQL 1 COALESCE. Another useful technique is to replace NULL values with a standard value. ... 2 NULLIF. NULLIF is, in a sense, the opposite of COALESCE. ... 3 LEAST / GREATEST. Two functions often come in handy for data preparation are the LEAST and GREATEST functions. ... 4 Casting. ... 5 DISTINCT

How to clean up MSDB logs and history?

Alternatively, you can run the stored procedure sp_cleanup_log_shipping_history and clean up the table listed above. The MSDB SQL Database is an essential and valuable database for various functions in SQL Server. SQL Server stores many logs, history data in the system Database MSDB.


2 Answers

The best solution is to use a CASS certified address standardization program or service that will format and validate the address. Beyond the USPS which has tools for this, there are many third-party programs or services which provide this functionality. Address parsing is far more complicated than you might imagine and thus trying whip up a few queries to do it will be fraught with peril.

Google's Geocoding is another place to look.. Apparently Google requires you display the results to use their Geocoding service. That leaves using dedicated address parsers like the USPS or a third-party program.

like image 60
Thomas Avatar answered Nov 15 '22 04:11

Thomas


Using group by soundex(name) you will get result like this. You have to test on your data to figure out if this is helpful in your situation or not. I can not test this on SQL Server 2000 so I am not sure if soundex is available.

declare @T table (Code char(5), Name varchar(50), Address1 varchar(50))
insert into @T values
('10003', 'AMERICAN NUTRITON INC',  '2183 BALL STREET'),
('10003', 'AMEICAN NUTRITION INC',  '2183 BALL STREET'),
('10003', 'AMERICAN NUTRITION INC', '2183 BALL STREET'),
('10003', 'AMERICAN NUTRITION INC', '2183 BALL STREET'),
('10003', 'Samantha Brooks',        '506 S. Main Street'),
('10003', 'BEMIS COMPANY',          '1401 W. FOURTH PLAIN BLVD.'),
('10003', 'CEI',                    '597 VANDYRE BOULEVARD'),
('10003', 'Pacific Pet',            '28th Avenue'),
('10003', 'PETSMART, INC.',         '16091 NORTH 25TH STREET'),
('10003', 'THE PET FIRM',           '16418 NORTH 37TH STREET')

select
  min(Code) as Code,
  min(Name) as Name,
  min(Address1) as Address1
from @T
group by soundex(Name)
________________________________________________________
Code  Name                    Address1
10003 AMEICAN NUTRITION INC   2183 BALL STREET
10003 AMERICAN NUTRITION INC  2183 BALL STREET
10003 BEMIS COMPANY           1401 W. FOURTH PLAIN BLVD.
10003 CEI                     597 VANDYRE BOULEVARD
10003 Pacific Pet             28th Avenue
10003 PETSMART, INC.          16091 NORTH 25TH STREET
10003 Samantha Brooks         506 S. Main Street
10003 THE PET FIRM            16418 NORTH 37TH STREET
like image 36
Mikael Eriksson Avatar answered Nov 15 '22 05:11

Mikael Eriksson