I'm using Net::HTTP with Ruby to crawl an URL.
I don't want to crawl streaming audio such as: http://listen2.openstream.co/334
in fact i only want to crawl Html content, so no pdfs, video, txt..
Right now, I have both open_timeout and read_timeout set to 10, so even if I do crawl these streaming audio pages they will timeout.
url = 'http://listen2.openstream.co/334'
path = uri.path
req= Net::HTTP::Get.new(path, {'Accept' => '*/*', 'Content-Type' => 'text/plain; charset=utf-8', 'Connection' => 'keep-alive','Accept-Encoding' => 'Identity'})
uri = Addressable::URI.parse(url)   
resp =  Net::HTTP.start(uri.host, uri.inferred_port) do |httpRequest|
    httpRequest.open_timeout = 10
    httpRequest.read_timeout = 10
    #how can I read the headers here before it's streaming the body and then exit b/c the content type is audio?
    httpRequest.request(req)
end
However, is there a way to check the header BEFORE I read the body of a http response to see if it's an audio? I want to do so without sending a separate HEAD request.
net/http supports streaming, you can use this to read the header before the body.
Code example,
url = URI('http://stackoverflow.com/questions/41306082/ruby-nethttp-read-the-header-before-the-body-without-head-request')
Net::HTTP.start(url.host, url.port) do |http|
  request = Net::HTTP::Get.new(url)
  http.request(request) do |response|
    # check headers here, body has not yet been read
    # then call read_body or just body to read the body
    if true  
      response.read_body do |chunk|
        # process body chunks here
      end
    end
  end
end
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With