Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop and Stata

Does anyone have any experience using Stata and Hadoop? Stata 13 now has a Java Plugin API, so I think it should be straightforward to get them to play nice.

I am particularly interested in being able to parse weblog data to get it into a form suitable for statistical analysis.

This question came up on Statalist recently, but there was no response, so I thought I would try it here where the audience is more likely to have experience with this technology.

like image 891
dimitriy Avatar asked Oct 03 '13 17:10

dimitriy


People also ask

Is Stata good for big data?

Stata allows you to process datasets containing more than 2 billion observations if you have a big computer, and by big, we mean 512 GB or more of memory. Stata stores your data in memory. That makes Stata fast. It also means that datasets you wish to process must fit in memory.

Is Stata faster than R?

Is Stata faster than R? Stata took 67.25 seconds to write a 458MB raw text file, whereas R took 72.93 seconds to do the same. This means Stata exported data 8% faster. Stata took 118.35 seconds, but R only took 42.53 seconds.

What is Hadoop as a service?

Hadoop as a service (HaaS), also known as Hadoop in the cloud, is a big data analytics framework that stores and analyzes data in the cloud using Hadoop. Users do not have to invest in or install additional infrastructure on premises when using the technology, as HaaS is provided and managed by a third-party vendor.

How many observations can Stata handle?

Stata/BE allows datasets with as many as 2,048 variables and 2 billion observations.


1 Answers

Dimitry,

I think it would be easier to do something like this using the ELK Stack (http://www.elastic.co). Logstash (the middle layer) has several parsers/tokenizers/analyzes built on the Apache Lucene engine for cleaning and formatting log data and can push the resulting data into elasticsearch, which exposes an HTTP API that you can curl fairly easily to get results (e.g., use insheetjson and pass the HTTP GET request as the URL and it should be imported into Stata without much problem).

I've been trying to cobble together a program to use the Jackson JSON library to build out more robust JSON I/O capabilities from within Stata and would definitely not mind trying to work with others to get it done.

Hope this helps, Billy

like image 180
BBuchanan Avatar answered Sep 30 '22 13:09

BBuchanan