Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Acquiring basic skills working with visualizing/analyzing large data sets [closed]

I'm looking for a way to learn to be comfortable with large data sets. I'm a university student, so everything I do is of "nice" size and complexity. Working on a research project with a professor this semester, and I've had to visualize relationships between a somewhat large (in my experience) data set. It was a 15 MB CSV file.

I wrote most of my data wrangling in Python, visualized using GNUPlot.

Are there any accessible books or websites on the subject out there? Bonus points for using Python, more bonus points for a more "basic" visualization system than relying on gnuplot. Cairo or something, I suppose.

Looking for something that takes me from data mining, to processing, to visualization.

EDIT: I'm more looking for something that will teach me the "big ideas". I can write the code myself, but looking for techniques people use to deal with large data sets. I mean, my 15 MB is small enough where I can put everything I would ever need into memory and just start crunching. What do people do to visualize 5 GB data sets?

like image 619
Daniel Harms Avatar asked May 04 '11 23:05

Daniel Harms


People also ask

How do you Analyse a large data set?

For large datasets, analyze continuous variables (such as age) by determining the mean, median, standard deviation and interquartile range (IQR). Analyze nominal variables (such as gender) by using percentages.

What is the importance of data visualization for large data sets?

Data visualization is the practice of translating information into a visual context, such as a map or graph, to make data easier for the human brain to understand and pull insights from. The main goal of data visualization is to make it easier to identify patterns, trends and outliers in large data sets.


2 Answers

I'd say the most basic skill is a good grounding in math and statistics. This can help you assess and pick from the variety of techniques for filtering data, and reducing its volume and dimensionality while keeping its integrity. The last thing you'd want to do is make something pretty that shows patterns or relationships which aren't really there.

Specialized math

To tackle some types of problems you'll need to learn some math to understand how particular algorithms work and what effect they'll have on your data. There are various algorithms for clustering data, dimensionality reduction, natural language processing, etc. You may never use many of these, depending on the type of data you wish to analyze, but there are abundant resources on the Internet (and Stack Exchange sites) should you need help.

For an introductory overview of data mining techniques, Witten's Data Mining is good. I have the 1st edition, and it explains concepts in plain language with a bit of math thrown in. I recommend it because it provides a good overview and it's not too expensive -- as you read more into the field you'll notice many of the books are quite expensive. The only drawback is a number of pages dedicated to using WEKA, an Java data mining package, which might not be too helpful as you're using Python (but is open source, so you may be able to glean some ideas from the source code. I also found Introduction to Machine Learning to provide a good overview, also reasonably priced, with a bit more math.

Tools

For creating visualizations of your own invention, on a single machine, I think the basics should get you started: Python, Numpy, Scipy, Matplotlib, and a good graphics library you have experience with, like PIL or Pycairo. With these you can crunch numbers, plot them on graphs, and pretty things up via custom drawing routines.

When you want to create moving, interactive visualizations, tools like the Java-based Processing library make this easy. There are even ways of writing Processing sketches in Python via Jython, in case you don't want to write Java.

There are many more tools out there, should you need them, like OpenCV (computer vision, machine learning), Orange (data mining, analysis, viz), and NLTK (natural language, text analysis).

Presentation principles and techniques

Books by folks in the field like Edward Tufte and references like Information Graphics can help you get a good overview of the ways of creating visualizations and presenting them effectively.

Resources to find Viz examples

Websites like Flowing Data, Infosthetics, Visual Complexity and Information is Beautiful show recent, interesting visualizations from across the web. You can also look through the many compiled lists of of visualization sites out there on the Internet. Start with these as a seed and start navigating around, I'm sure you'll find a lot of useful sites and inspiring examples.

(This was originally going to be a comment, but grew too long)

like image 170
samplebias Avatar answered Nov 05 '22 07:11

samplebias


Check out Information is beautiful. It is not a technical book but it might give you a couple of ideas for visualising data.

And maybe have a look at the first 3 chapters of Principles of Data Mining, it goes through some concepts of visualizing data in data mining context, I found some parts of it useful during university.

Hope this helps

like image 20
Marcom Avatar answered Nov 05 '22 08:11

Marcom