What are the differences between structured data and unstructured data? How that difference affect the respective data mining approaches?
The terms i am familiar with are structured and unstructured data(same as what's in your Q except for the suffix).
I work with both types of data in machine learning and I am not aware of any formal definition; however, i suspect that nearly everyone whose work requires a distinction between these two types of data has no trouble distinguishing them.
Examples of structured data: the date/time on which an email was sent; whether it has an attachment, or the email sender. Unstructured data: the body of the email.
Is there a stable rule or set of rules to distinguish these two types of data? I think so. First, if you can build a parser for the data element, then it's structured.
Another rule of thumb is to look at the data type for that field in your database required to store the data. If it is a text type--for MySQL, Tintext, Text, Mediumtext, or Longtext. Or less likely, VARCHAR(255)--then that data is probably unstrutured.
The principal significance of this distinction for data mining is probably this: structured data, once extracted from the document and parsed, can be used as variables in a statistical/machine learning model. Unstructured data, however, requires further parsing--i.e., before you can use it in modeling you first have to decompose it into a set of structured data elements--e.g., number of words, etc.
For instance, suppose you want to build a knowledge management (KM) system for a server group within a company that makes online MMORPGs. You might begin with the massive collection of email messages exchanged between the members of this group.
So you create a data model for this source--e.g., comprised of fields like 'sender', 'recipient', 'date/time sent', whether the recipient and sender were both employees of the server group, whether the message was was copied to others, etc. The rows of the databse are the individual emails.
Then you write a script comprised of a set of parsers to extract each field from each email message. For many fields, this is simple, e.g., for the 'cc:' field, you write a parser to scan that portion of the email message and check whether it is empty--if it is, then that field in your database for that row might be filled with 'False' (to indicate that no persons are copied), otherwise, 'True'. Likewise, data/time, which is probably in some form like: 16 Mar 2011 18:45:39.0319 (UTC). Extracting and parsing this data is likewise straightforward; in fact, your scripting language almost certainly has a module to do it.
But when you get to the body of the email, while it's not difficult to extract from the rest of the email message, parsing it is not straightforward. Your data model might have fields for "NumberOfWords", "Keywords", etc. and it's simple to build a parser to populate those fields. The most useful information is more difficult though--i.e., was the email message helpful to the recipient? What was the subject? Is it authoritative?
Data Mining of unstructured data usually falls under the category of "text mining". There are two different opinions on this. One opinion says that you need specialized tools to perform Natural Language Processing (NLP), since that is the only way you can derive semantic meaning. The other approach will transform the unstructured data into word matrices and then use standard statistical techniques to perform data mining ("bag of words"). In this case everything becomes data and order of words is not important.
-Ralph Winters
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With