Downloading the entire Bitcoin transaction chain with R

Tags:

bitcoin

I'm pretty new here so thank you in advance for the help. I'm trying to do some analysis of the entire Bitcoin transaction chain. In order to do that, I'm trying to create 2 tables

1) A full list of all Bitcoin addresses and their balance, i.e.,:

| ID | Address     | Balance  |
-------------------------------
| 1  | 7d4kExk...  | 32       |
| 2  | 9Eckjes...  | 0        |
| .  | ...         | ...      |

2) A record of the number of transactions that have ever occurred between any two addresses in the Bitcoin network

| ID | Sender      | Receiver      | Transactions |
--------------------------------------------------
| 1  | 7d4kExk...  | klDk39D...    | 2            |
| 2  | 9Eckjes...  | 7d4kExk...    | 3            |
| .  | ...         | ...           | ..           |

To do this I've written a (probably very inefficient) script in R that loops through every block and scrapes blockexplorer.com to compile the tables. I've tried running it a couple of times so far but I'm running into two main issues

1 - It's very slow... I can imagine it's going to take at least a week at the rate that it's going

2 - I haven't been able to run it for more than a day or two without it hanging. It seems to just freeze RStudio.

I'd really appreaciate your help in two areas:

1 - Is there a better way to do this in R to make the code run significantly faster?

2 - Should I stop using R altogether for this and try a different approach?

Thanks in advance for the help! Please see below for the relevant chunks of code I'm using

url_start <- "http://blockexplorer.com/b/"
url_end <- ""

readUrl <- function(url) {  
  table <- try(readHTMLTable(url)[[1]])
  if(inherits(table,"try-error")){
    message(paste("URL does not seem to exist:", url))
    errors <- errors + 1
    return(NA)
  } else {
    processed <- processed + 1
    return(table)
  }

}

block_loop <- function (end, start = 0) {

...

  addr_row <- 1 #starting row to fill out table
  links_row <- 1 #starting row to fill out table      

  for (i in start:end) {
    print(paste0("Reading block: ",i))
    url <- paste(url_start,i,url_end, sep = "")
    table <- readUrl(url)

    if(is.na(table)){ next } 

....

937

asked Aug 15 '13 05:08

guayosr

1 Answers

There are very close to 250,000 blocks on the site you mentioned (at least, 260,000 gives a 404). Curling from my connection (1 MB/s down) gives an average speed of about half a second. Try it yourself from the command line (just copy and paste) to see what you get:

curl -s -w "%{time_total}\n" -o /dev/null http://blockexplorer.com/b/220000

I'll assume your requests are about as fast as mine. Half a second times 250,000 is 125,000 seconds, or a day and a half. This is the absolute best you can get using any methods because you have to request the page.

Now, after doing an install.packages("XML"), I saw that running readHTMLTable(http://blockexplorer.com/b/220000) takes about five seconds on average. Five seconds times 250,000 is 1.25 million seconds which is about two weeks. So your estimates were correct; this is really, really slow. For reference, I'm running a 2011 MacBook Pro with a 2.2 GHz Intel Core i7 and 8GB of memory (1333 MHz).

Next, table merges in R are quite slow. Assuming 100 records per table row (seems about average) you'll have 25 million rows, and some of these rows have a kilobyte of data in them. Assuming you can fit this table in memory, concatenating tables will be a problem.

The solution to these problems that I'm most familiar with is to use Python instead of R, BeautifulSoup4 instead of readHTMLTable, and Pandas to replace R's dataframe. BeautifulSoup is fast (install lxml, a parser written in C) and easy to use, and Pandas is very quick too. Its dataframe class is modeled after R's, so you probably can work with it just fine. If you need something to request URLs and return the HTML for BeautifulSoup to parse, I'd suggest Requests. It's lean and simple, and the documentation is good. All of these are pip installable.

If you still run into problems the only thing I can think of is to get maybe 1% of the data in memory at a time, statistically reduce it, and move on to the next 1%. If you're on a machine similar to mine, you might not have another option.

answered Oct 03 '22 07:10

jclancy

Related questions
                            
                                ggplot: How to create a discrete color palette that fits the data automatically?
                            
                                Add isoclines and/or direction field to plot
                            
                                Optimize a function of a function in r
                            
                                reading strand (+, -) column with fread, data.table package
                            
                                specifying output path for knit2html
                            
                                Why do I have to create the directory "~/R/%p-library/%v" by hand on each R upgrade?
                            
                                Keep row names when using rbind.fill in R
                            
                                how to prevent axes from intersecting in ggplot2
                            
                                RNetCDF cannot open shared object file
                            
                                What's the fastest way to get CSV output into a data frame?
                            
                                Efficient way to sample from different probability vectors
                            
                                full precision may not have been achieved in 'qbeta'
                            
                                httr GET function running out of space when downloading a large file
                            
                                R: promise already under evaluation
                            
                                knitr - change code indentation
                            
                                How to get the list of all Yahoo Finance mutual funds in R?
                            
                                Counting species occurrence in a grid
                            
                                Robust cross-platform method of moving a directory
                            
                                R coda "The leading minor of order 3 is not positive definite"
                            
                                auto.arima not parallelizing

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With