Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Downloading the entire Bitcoin transaction chain with R

Tags:

r

bitcoin

I'm pretty new here so thank you in advance for the help. I'm trying to do some analysis of the entire Bitcoin transaction chain. In order to do that, I'm trying to create 2 tables

1) A full list of all Bitcoin addresses and their balance, i.e.,:

| ID | Address     | Balance  |
-------------------------------
| 1  | 7d4kExk...  | 32       |
| 2  | 9Eckjes...  | 0        |
| .  | ...         | ...      |

2) A record of the number of transactions that have ever occurred between any two addresses in the Bitcoin network

| ID | Sender      | Receiver      | Transactions |
--------------------------------------------------
| 1  | 7d4kExk...  | klDk39D...    | 2            |
| 2  | 9Eckjes...  | 7d4kExk...    | 3            |
| .  | ...         | ...           | ..           |

To do this I've written a (probably very inefficient) script in R that loops through every block and scrapes blockexplorer.com to compile the tables. I've tried running it a couple of times so far but I'm running into two main issues

1 - It's very slow... I can imagine it's going to take at least a week at the rate that it's going

2 - I haven't been able to run it for more than a day or two without it hanging. It seems to just freeze RStudio.

I'd really appreaciate your help in two areas:

1 - Is there a better way to do this in R to make the code run significantly faster?

2 - Should I stop using R altogether for this and try a different approach?

Thanks in advance for the help! Please see below for the relevant chunks of code I'm using

url_start <- "http://blockexplorer.com/b/"
url_end <- ""

readUrl <- function(url) {  
  table <- try(readHTMLTable(url)[[1]])
  if(inherits(table,"try-error")){
    message(paste("URL does not seem to exist:", url))
    errors <- errors + 1
    return(NA)
  } else {
    processed <- processed + 1
    return(table)
  }

}

block_loop <- function (end, start = 0) {

...

  addr_row <- 1 #starting row to fill out table
  links_row <- 1 #starting row to fill out table      

  for (i in start:end) {
    print(paste0("Reading block: ",i))
    url <- paste(url_start,i,url_end, sep = "")
    table <- readUrl(url)

    if(is.na(table)){ next } 

....
like image 937
guayosr Avatar asked Aug 15 '13 05:08

guayosr


People also ask

How do I download all Bitcoin transactions?

In the left navigation panel, click on the currency you're interested in (e.g. Bitcoin). If you have transaction history for that currency, there will be a Download button next to the search bar. Click on Download and select the wallet(s) for which you want to export transaction history from the drop-down menu.

What is the name of the general ledger that tracks all the Bitcoin transactions?

A blockchain is a form of public ledger, which is a series (or chain) of blocks on which transaction details are recorded after suitable authentication and verification by the designated network participants.

Where is the entire blockchain stored?

Blockchain is decentralized and hence there is no central place for it to be stored. That's why it is stored in computers or systems all across the network. These systems or computers are known as nodes. Each of the nodes has one copy of the blockchain or in other words, the transactions that are done on the network.

How do I track Bitcoin transactions?

Bitcoin's blockchain can be accessed at https://blockchain.info/. Here, you'll be able to enter your Bitcoin TxID, or your exchange or wallet address, to track your transactions. You will see a summary of information about the transaction, including the number of confirmations it has.


1 Answers

There are very close to 250,000 blocks on the site you mentioned (at least, 260,000 gives a 404). Curling from my connection (1 MB/s down) gives an average speed of about half a second. Try it yourself from the command line (just copy and paste) to see what you get:

curl -s -w "%{time_total}\n" -o /dev/null http://blockexplorer.com/b/220000

I'll assume your requests are about as fast as mine. Half a second times 250,000 is 125,000 seconds, or a day and a half. This is the absolute best you can get using any methods because you have to request the page.

Now, after doing an install.packages("XML"), I saw that running readHTMLTable(http://blockexplorer.com/b/220000) takes about five seconds on average. Five seconds times 250,000 is 1.25 million seconds which is about two weeks. So your estimates were correct; this is really, really slow. For reference, I'm running a 2011 MacBook Pro with a 2.2 GHz Intel Core i7 and 8GB of memory (1333 MHz).

Next, table merges in R are quite slow. Assuming 100 records per table row (seems about average) you'll have 25 million rows, and some of these rows have a kilobyte of data in them. Assuming you can fit this table in memory, concatenating tables will be a problem.

The solution to these problems that I'm most familiar with is to use Python instead of R, BeautifulSoup4 instead of readHTMLTable, and Pandas to replace R's dataframe. BeautifulSoup is fast (install lxml, a parser written in C) and easy to use, and Pandas is very quick too. Its dataframe class is modeled after R's, so you probably can work with it just fine. If you need something to request URLs and return the HTML for BeautifulSoup to parse, I'd suggest Requests. It's lean and simple, and the documentation is good. All of these are pip installable.

If you still run into problems the only thing I can think of is to get maybe 1% of the data in memory at a time, statistically reduce it, and move on to the next 1%. If you're on a machine similar to mine, you might not have another option.

like image 66
jclancy Avatar answered Oct 03 '22 07:10

jclancy