Re: [Haskell-cafe] Converting wiki pages into pdf

9 Sep 2011

      Thank you all for replying. I managed to write a python script. It depends
on PyQt4 . I am curious if we have any thing like PyQt4  in Haskell.

import sys
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import *

#http://www.rkblog.rk.edu.pl/w/p/webkit-pyqt-rendering-web-pages/
#http://pastebin.com/xunfQ959
#
http://bharatikunal.wordpress.com/2010/01/31/converting-html-to-pdf-with-pyt...
#http://www.riverbankcomputing.com/pipermail/pyqt/2009-January/021592.html

def convertFile( ):
                web.print_( printer )
                print "done"
                QApplication.exit()

if __name__=="__main__":
        url = raw_input("enter url:")
        filename = raw_input("enter file name:")
        app = QApplication( sys.argv )
        web = QWebView()
        web.load(QUrl( url ))
        #web.show()
        printer = QPrinter( QPrinter.HighResolution )
        printer.setPageSize( QPrinter.A4 )
        printer.setOutputFormat( QPrinter.PdfFormat )
        printer.setOutputFileName(  filename + ".pdf" )
        QObject.connect( web ,  SIGNAL("loadFinished(bool)"), convertFile  )
        sys.exit(app.exec_())

On Fri, Sep 9, 2011 at 11:03 AM, Matti Oinas  wrote:
...
The whole wikipedia database can also be downloaded if that is any help.
http://en.wikipedia.org/wiki/Wikipedia:Database_download
There is also text in that site saying "Please do not use a web
crawler to download large numbers of articles. Aggressive crawling of
the server can cause a dramatic slow-down of Wikipedia."
Matti
...
It's worth pointing out at this point (as alluded to by Conrad) that what
you're attempting might be considered somewhat rude, and possibly
slightly
illegal (depending on the insanity of the legal system in question).
Automated site scraping (what you're essentially doing) is generally
frowned
upon by most hosts unless it follows some very specific guidelines,
usually
at a minimum respecting the restrictions specified in the robots.txt file
contained in the domains root. Furthermore, depending on the type of data
in
question, and if a EULA was agreed to if the site requires an account,
doing
any kind of automated processing might be disallowed. Now, I think
wikipedia
has a fairly lenient policy, or at least I hope it does considering it's
community driven, but depending on how much of wikipedia you're planning
on
crawling you might at the very least consider severly throttling the
2011/9/9 Kyle Murphy :
process
...
to keep from sucking up too much bandwidth.
On the topic of how to actually perform that crawl, you should probably
check out the format of the link provided in the download PDF element.
After
looking at an article (note, I'm basing this off a quick glance at a
single
page) it looks like you should be able to modify the URL provided in the
"Permanent link" element to generate the PDF link by changing the title
argument to arttitle, adding a new title argument with the value
"Special:Book", and adding the new arguments "bookcmd=render_article" and
"writer=rl". For example if the permanent link to the article is:
http://en.wikipedia.org/w/index.php?title=Shapinsay&oldid=449266269
Then the PDF URL is:
http://en.wikipedia.org/w/index.php?arttitle=Shapinsay&oldid=449266269&title=Special:Book&bookcmd=render_article&write=rl
...
This is all rather hacky as well, and none of it has been tested so it
...
not actually work, although I see no reason why it shouldn't. It's also
fragile, as if wikipedia changes just about anything it could all brake,
but
that's the risk you run anytime you resort of site scraping.
-R. Kyle Murphy
--
Curiosity was framed, Ignorance killed the cat.
On Thu, Sep 8, 2011 at 23:40, Conrad Parker 
wrote:
...
On Sep 9, 2011 7:33 AM, "mukesh tiwari" 
wrote:
...
Thank your for reply Daniel. Considering my limited knowledge of web
programming and javascript , first i need to simulated the some sort
of
...
...
browser in my program which will run the javascript and will generate
...
...
...
pdf. After that i can download the pdf . Is this you mean ?  Is
Network.Browser any helpful for this purpose ? Is there  way to solve
...
...
...
problem ?
Sorry for  many questions but this  is my first web application
might
the
this
program
...
...
...
and i am trying hard to finish it.
Have you tried finding out if simple URLs exist for this, that don't
require Javascript? Does Wikipedia have a policy on this?
Conrad.
...
On Fri, Sep 9, 2011 at 4:17 AM, Daniel Patterson
 wrote:
...
It looks to me that the link is generated by javascript, so unless
you
...
...
can script an actual browser into the loop, it may not be a viable
approach.
On Sep 8, 2011, at 3:57 PM, mukesh tiwari wrote:
...
I tried to use the PDF-generation facilities . I wrote a script
which
generates the rendering url . When i am pasting rendering url in
browser its generating the download file but when i am trying to
get
the tags , its empty. Could some one please tell me what is wrong
with
code.
Thank You
Mukesh Tiwari
import Network.HTTP
import Text.HTML.TagSoup
import Data.Maybe
parseHelp :: Tag String -> Maybe String
parseHelp ( TagOpen _ y ) = if ( filter ( \( a , b ) -> b ==
"Download
a PDF version of this wiki page" ) y )  /= []
                           then Just $  "http://en.wikipedia.org"
++
 ( snd $
y !!  0 )
                            else Nothing
parse :: [ Tag String ] -> Maybe String
parse [] = Nothing
parse ( x : xs )
  | isTagOpen x = case parseHelp x of
                       Just s -> Just s
                       Nothing -> parse xs
  | otherwise = parse xs
main = do
      x <- getLine
      tags_1 <-  fmap parseTags $ getResponseBody =<< simpleHTTP
( getRequest x ) --open url
      let lst =  head . sections ( ~== "<div class=portal
id=p-coll-
print_export>" ) $ tags_1
          url =  fromJust . parse $ lst  --rendering url
      putStrLn url
      tags_2 <-  fmap parseTags $ getResponseBody =<< simpleHTTP
( getRequest url )
      print tags_2
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
--
/*******************************************************************/
try {
   log.trace("Id=" + request.getUser().getId() + " accesses " +
manager.getPage().getUrl().toString())
} catch(NullPointerException e) {}
/*******************************************************************/
This is a real code, but please make the world a bit better place and
don’t do it, ever.
*
http://www.javacodegeeks.com/2011/01/10-tips-proper-application-logging.html...