Skip to content Skip to sidebar Skip to footer

Hpricot, Get All Text From Document

I have just started learning Ruby. Very cool language, liking it a lot. I am using the very handy Hpricot HTML parser. What I am looking to do is grab all the text from the page,

Solution 1:

You can do this using the XPath text() selector.

require 'hpricot'
require 'open-uri'

doc  = open("http://stackoverflow.com/") { |f| Hpricot(f) }
text = (doc/"//*/text()") # array of text values
puts text.join("\n")

However this is a fair expensive operation. A better solution might be available.


Solution 2:

You might want to try inner_text.

Like this:

h = Hpricot("<html><body><a href='http://yoursite.com?utm=trackmeplease'>http://yoursite.com</a> is <strong>awesome</strong>")
puts h.inner_text
http://yoursite.com is awesome

Solution 3:

@weppos: This will be bit better:

text = doc/"//p|div/text()" # array of text values

Post a Comment for "Hpricot, Get All Text From Document"