Use Capybara w/ Poltergeist (PhantomJS) to scrape the text content from the body of an HTML page located at the given URL.

12/10/2015 - 9:53 AM

Use Capybara w/ Poltergeist (PhantomJS) to scrape the text content from the body of an HTML page located at the given URL.

#!/usr/bin/env ruby
require 'capybara'
require 'capybara/poltergeist'

class Scraper
  include Capybara::DSL
  Capybara.register_driver :poltergeist do |app|
    Capybara::Poltergeist::Driver.new app,
      phantomjs_options: ['--load-images=no','--ignore-ssl-errors=yes'],
      js_errors: false,
      inspector: false,
      debug: false
  end
 
  attr_accessor :document, :response_headers, :title, :metas, :text

  def initialize(url)
    @session = Capybara::Session.new(:poltergeist)
    @session.driver.headers = { 'User-Agent' => "Mozilla/5.0 (Macintosh; Intel Mac OS X)" }
    get url
  end

  def get(url)
    @session.visit url
    @document = @session.document
    @response_headers = @session.response_headers

    @title = @session.title
    @metas = @session.find_all('meta', visible: false).collect(&:native).collect(&:attributes)
    @text  = @document.text 'body'

    @session.driver.quit
    self
  end

end

s = Scraper.new "http://www.humani.se"

Cacher is the code snippet organizer for pro developers

We empower you and your team to get more done, faster

Use Capybara w/ Poltergeist (PhantomJS) to scrape the text content from the body of an HTML page located at the given URL.