Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

.NET and Hadoop - What should I know / learn and what is available? [closed]

Information

My question is regarding BigData in .NET. BigData is used to store and query huge amounts of data (Facebook, Google, Twitter, ...). Examples of BigData are MapReduce, Hadoop, Dryad, etc.

Microsoft dropped their Dryad (DryadLinq) alternative in favor of Hadoop (Dryad and the article), so I'd like to prepare myself for it and everything that has to do with it.

What I already know

What is available now?

Hadoop Connector

SQL Server 2012 RC (don't use in production :))

Microsoft Information on Big Data

What should I know more about releases and development?

Register on the TechPreview

Questions

Question 1: What should I know about Hadoop that isn't unique to the .NET platform? (how to query, specific patterns, architecture, ...) and will be usefull (in a .NET environment)

Question 2: Is there more information on the Hadoop in the .NET platform, than I already know?

like image 481
NicoJuicy Avatar asked Nov 24 '11 12:11

NicoJuicy


1 Answers

it's a vague question so here's a vague answer :)

Hadoop on its own is a tool to run map-reduce jobs in a cluster, it's highly optimized for performance and a good deal of this optimization is done by distributing the data in a way that makes it easy to consume without incurring on I/O penalties.

for this you should read about HDFS and the internals that explain how is this done, in a nutshell what happens is that the input data is clumped together in nodes to run the processes locally and read sequentially (this is a property/limitation of HDFS).

this way you input your "BigData" and it gets split and processed in the most efficient way inside the cluster.

now that' all there is to Hadoop itself, there's tools that work on top of it that allow you to perform high-level abstractions on the data (map-reduce is among the simplest procedures).

those include:

  • Pig http://pig.apache.org/ which is a language to work with the map-reduce process and construct more complex operations
  • Hive http://hive.apache.org/ similar to the previous but more SQL-oriented
  • Cascading http://www.cascading.org/ yet another, more focused on data flow than queries
  • Cascalog https://github.com/nathanmarz/cascalog based on Cascading, written in Clojure
  • HBase http://hbase.apache.org/ a type of NoSQL database on top of HDFS
  • ElephantDB https://github.com/nathanmarz/elephantdb another NoSQL database for Hadoop

Specifics for .Net

For Hadoop on Azure (.Net) , there's an introduction on msdn here with more info here. Related to building Hadoop applications through their platform. It's only CTP for now, but off course this will change.

Here's another good blogpost about Hadoop and MapReduce with code

Additionally, there's also a company that frequently gives information about Hadoop: Cloudera, you should check there frequently for more information. For more information, check the cloudera page linked above and you can view all the concepts about Hadoop (it's pretty advanced though)

I'm pretty sure this isn't what you were looking for but I've no idea what you want so at least I hope you can check a few new projects that may help.

also check Storm: https://github.com/nathanmarz/storm it's not related to Hadoop but works on realtime scenarios which Hadoop is not suited for.

like image 70
Samus_ Avatar answered Oct 03 '22 00:10

Samus_