ANN: vxml (validating xml lib) - proof of concept - need some guidance - bad type level performance (ghc)

Hi @ll. This shouldn't have been an announce yet.. it's more crying for help to get to know how / wether to continue :-) I had an idea: a xml generating library validating the result against a given dtd (to be used in web frameworks etc .. ) After one week of coding I got the result git clone git://mawercer.de/vxml (1) (see README) git hash as of writing : 4dc53 A minimal example looks like this (taken from test.hs): import Text.XML.Validated.TH $( dtdToTypes "dtds/xhtml1-20020801/DTD/xhtml1-strict_onefile.dtd" (XmlIds (Just "-//W3C//DTD XHTML 1.0 Strict//EN") (Just "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd") ) ) main = putStrLn $ xml $ ((html << (head << (title <<< "vxml hello world"))) << (body << (div <<< "PCDATA text" )) ) Which takes about 7 secs to compile on my (fast) 2.5 Ghz machine. Before Oleg telling me about how to use the TypeEq implementation found in HList it took several hours (switch the cabal configure flag to see experience the incredible slow TypeToNat implementation) However type checking the simple small document (2) already takes 35 seconds, duplicating the tags within the body raises compilation time to 74 seconds. (Have a look at test/XmlToQ.hs to automatically create the code generating such a given xml file (its used in run-testcases.hs) The dtd representation used is very easy to understand: (3a) shows allowed subelements of the head tag Seq = sequence Star = zero or more ... (3b) shows the list of allowed attributes for a more commonly used tag such as div the element list (3a) is 324 lines long and there are approximately 15 attributes. Use a line such as putStrLn $ xml $ debugEl div to get this view. All the work is done by class Consume st el r | st el -> r -- result is on of C,CS,R,F class Retry elType st el st' | st el -> st' st = state as given in (3a) el = child to be added (Elem Html_T) or PCDATA C = element consumed, end CS a = element consumed, continue with state a R a = retry with given state (can happen after removing a Star = ()* on a no match ) F a = no match, show failure a instance Retry elType C el C instance Retry elType (CS st) el st instance ( -- retry Consume st el r , Retry elType r el st' ) => Retry elType (R st) el st' Maybe you knowing much more about ghc internals have some more ideas how to optimize? I only came up with a) implement xml HTrue $ xmldoc -- type checking variant xml HFalse $ xmldoc -- unchecked but faster variant so you can run xml HTrue once a day only while having lunch.. ( doubt Maybe you have to take lunch two or more times on intermediate web projects .. ) b) change (A attrType) to attrType only and (Elem e) to e which does not work for the same reason as class TypeEq a b c | a b -> c instance TypeEq a a HTrue instance TypeEq a b HFalse doesn't. Of course a a will match a b as well, but the result is totally different.. So would it be possible to either tell ghc to ignore this happily taking the result of the better matching instance? Or to explicitely tell ghc when an instance matches the way it does in current implementations make it a no match if some constraints can or cannot be satisfied such as this: class TypeEq a b c | a b -> c instance TypeEq a a HTrue instance [ NoMatchOn a b -- if this can be satisfied ignore this instance ] => TypeEq a b HFalse class NoMatchOn a b instance NoMatchOn a a c) Having a reduction rule such as this: If the left side of (Or a b) returnns C (consumed) tell ghc to not evaluate the result of (Consume b element result) which must be done to see wether there is closer instance match (Is this correct ?) Of course not everything is implemented yet.. but except speed I consider it beeing quite usable. He, thanks for reading till the end. Any feedback and suggestions are appreciated. If you have any trouble contact me on irc. I've been using $ ghc --version The Glorious Glasgow Haskell Compilation System, version 6.8.2 Sincerly Marc Weber (1) If you are interested and either don't have git or are not familiar using it I can send you a tarball. (2) <!DOCTYPE html SYSTEM "/tmp/dtd.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US"> <head> <title>Example 6 - XHTML 1.0 Strict as application/xhtml+xml</title> <meta http-equiv="Content-Type" content="application/xhtml+xml; charset=utf-8" /> <link rel="stylesheet" type="text/css" href="style.css" /> </head> <body> <h1>Example 6 - XHTML 1.0 Strict as application/xhtml+xml</h1> <p> This document is valid XHTML 1.0 Strict served as <code>application/xhtml+xml</code>. </p> <p> This document references CSS rules contained in an external stylesheet via <code>link</code>. </p> <p> Note how the CSS rules for the background are applied in Netscape 7.x, Mozilla, Opera 7 but that Internet Explorer can not display the page at all. </p> <p> <a href="http://validator.w3.org/check/referer"><img src="http://www.w3.org/Icons/valid-xhtml10" alt="Valid XHTML 1.0!" height="31" width="88" /></a> </p> </body> </html> (3a) (Seq (Star (Or (Elem Script_T) (Or (Elem Style_T) (Or (Elem Meta_T) (Or (Elem Link_T) (Elem Object_T)))))) (Or (Seq (Elem Title_T) (Seq (Star (Or (Elem Script_T) (Or (Elem Style_T) (Or (Elem Meta_T) (Or (Elem Link_T) (Elem Object_T)))))) (Query (Seq (Elem Base_T) (Star (Or (Elem Script_T) (Or (Elem Style_T) (Or (Elem Meta_T) (Or (Elem Link_T) (Elem Object_T)))))))))) (Seq (Elem Base_T) (Seq (Star (Or (Elem Script_T) (Or (Elem Style_T) (Or (Elem Meta_T) (Or (Elem Link_T) (Elem Object_T)))))) (Seq (Elem Title_T) (Star (Or (Elem Script_T) (Or (Elem Style_T) (Or (Elem Meta_T) (Or (Elem Link_T) (Elem Object_T))))))))))) (3b) (HCons (A Lang_A) (HCons (A Xml:lang_A) (HCons (A Dir_A) (HCons (A Id_A) (HCons (A Profile_A) HNil)))))

Oleg has pointed me into the right direction: He has suggested to use kind of class AttrOk elTrype attr for attributes (result 35s -> 30s) and do something similar with the state. (30s -> 4,5s). After doing this I was quite happy: The 4,5s do include reading the dtd and generating about 1500 DecQ declarations. There might still be some duplication.. in state transformation steps. However trying to typecheck the same file repeating the body 400 times didn't end.. why? Have a quick glance at the file data which shows quadratic behaviour. That's bad.. I would like to have some linear behaviour. Any ideas what is causing this? Is there a way to get linear scalability? However disabling all validation stuff (see cabal flag) no longer improves performance much. The data below proofs that right now validation increases compilation time by a factor 1.35- 1.45 (within the range of 5-30 replications of the body) If you'd like to play with the library and give some feedback I'd be happy. Read the README. The benchpress dependency is only needed for this benchmark test. You may want to remove it from the cabal file. So don't try to compile 4000 lines long xml files or be prepared to wait days. I should expand the benchmark to also offer results for xhtml lib. Sincerly Marc Weber ============= compilation times with validation ============================================== body replication count | compilation time [ms] 1 4146.477 2 4292.153 3 4508.56 4 4654.244000000001 5 4788.195 6 5041.674999999999 7 5347.1140000000005 8 6134.5960000000005 9 6019.624000000001 10 6459.544 11 7054.433999999999 12 7614.197 13 8489.003 14 8529.610999999999 15 9271.491 16 10058.419 17 12290.142 18 13736.074999999999 19 14863.893 20 15944.82 21 17856.611999999997 22 17977.841 23 17686.297 24 19279.314 25 20960.785 26 22750.754 27 24407.506 28 26342.242 29 28423.79 30 30932.777000000002 31 48478.841 32 45609.897 33 40574.255 34 41220.062 35 43952.545999999995 36 47437.922 37 50584.12100000001 38 53983.848000000005 39 57935.593 ============= ======================================================= eg starting gnuplot entering f(x)=(x-b)**2*c+a fit f(x) 'data' via a, b, c plot 'data', f(x) ============= compilation times without validation ================================ body replication count | compilation time [ms] 1 3939.887 << Import.hs has been recompiled. thats why the first took longer then the next 2 3138.127 3 3080.317 4 3179.19 5 3279.339 6 3604.609 7 3736.829 8 4094.6179999999995 9 4252.377 10 4767.588 11 5070.188 12 5537.007 13 5840.085 14 6478.276 15 6916.67 16 7568.444 17 8301.047 18 9211.368999999999 19 10025.886 20 10872.749 21 12086.263 22 12975.372 23 14366.282 24 15640.214 ============= =======================================================

There might still be some duplication.. in state transformation steps. I've removed duplicate state machine paths. Instead of approxmiately 1500 there are "only" 500 left.
You can see the impact on speed here http://mawercer.de/~marc/vxml.svg data + (red): approx 1500 data3 * (blue): approx 500 (current version) data2 x (green): without validation So most time is now spend somwhere else. Sincerly Marc Weber

Context: Basic xml validation of vxml does work now. So I'm looking for a convinient way to use it. (1) My first approach: putStrLn $ xml $ ((html_T << ( head_T << (title_T <<< "hw") << (link_T `rel_A` "stylesheet" `type_A` "text/css" `href_A` "style.css") )) << ( body_T << ((script_T `type_A` "text/javascript") <<< "document.writeln('hi');" ) << (div_T `onclick_A` "alert('clicked');" `style_A` "color:#F79" <<< "text within the div" ) ) ) comment: That's straight forward:
: add subelement
: add text However having to use many parenthesis to get nesting is awkward.
(2) My second idea: (#) = flip (.) putStrLn $ xml $ ( headC ( (titleC (<<< "hw")) # (linkC (rel_AF "stylesheet" # type_AF "text/css" # href_AF "style.css" ) ) ) # bodyC ( scriptC ( type_AF "text/javascript" # text "document.writeln('hi');" ) # divC ( onclick_AF "alert('clicked')" # style_AF "color:#F79" # text "text within the div" ) ) ) html_T comment: headC a b = head with context where a is a function adding subelements then adding itself to the elemnt passed by b Thus headC id parent would add headC to parent I don't feel much luckier this way (3) Third idea: xmlWithInnerIO <- execXmlT $ do xmlns "http://www.w3.org/1999/xhtml" >> lang "en-US" >> xml:lang "en-US" head $ title $ text "minimal" body $ do args <- lift $ getArgs h1 $ text "minimal" div $ text $ "args passed to this program: " ++ (show args) comment: WASH is using do notation which is really convinient. elements beeing at the same level can be concatenated by new lines, subelemnts can be added really nice as well. However: This can't work. (>>) :: m a -> m b -> m b but I need this (>>) :: m a -> m' b -> m'' b or (>>) :: m st a -> m st' b -> m st'' b along with functional dependencies that st' can be deduced from st and st'' from st'.. There are some happy cases eg when having a DTD such as (a | b)* because the state will "loop" and not change.. But this is no solution. (4) Another way would be defining << : (add subelement <|> : concatenate same level (+++) of xhtml lib html << head << title <<< "title" <|> meta .. <|> body << div <|> div However you already see the trouble.. ghc will read this as (html << (head << (title <<< "title"))) <|> meta .. <|> (body << div) -- body should be added to html, not to head! <|> div There would be a solution using different fixities html <<1 head <<2 title <<< "title" <|2> meta <|1> body <<2 div <|2> div <<3 div <<4 div so that <<4 binds stronger than <<3 etc.. But I think thats awkward as well. (5) But ghc is rich, I can think of another way: Quasi Quoting.. [$makeAFun| html do head do meta $1 link $2 body do div $3 div $4 |] (Dollar1 "bar") (Dollar2 "foo") (Dollar3 "foo3") (Dollar4 "foo4") the wrapper type sDollar{1,2,3,4} aren't necessary, but they will help eg if you remove the $2 line. They also enable you using a substitute more than once. (6) Another solution would be writing a preprocesor reusing alreday exsting code (HSP or WASH ?) or the haskell-src packages? Is there yet another solution which I've missed? I still think that (3) would be superiour.. Is there a way to define my own >>= and >> functions such as: {-# define custom do doX; (>>=) : mybind , >> : "my>>" #-} body $ doX args <- lift $ getArgs This would be terrific. Sincerly Marc Weber

I still think that (3) would be superiour.. Is there a way to define my own >>= and >> functions such as:
{-# define custom do doX; (>>=) : mybind , >> : "my>>" #-} body $ doX args <- lift $ getArgs This would be terrific.
Sincerly Marc Weber
dons has told me about 06:27 < dons> http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#re... 06:27 < lambdabot> Title: 8.3.�Syntactic extensions, example : module Main where import Prelude import Debug.Trace import System.IO main = do let (>>=) a b = trace (show "woah") $ (Prelude.>>=) a b getLine >>= print so actually this can be done? I'll try it. Marc Weber

Marc Weber wrote:
(3) Third idea: xmlWithInnerIO <- execXmlT $ do xmlns "http://www.w3.org/1999/xhtml" >> lang "en-US" >> xml:lang "en-US" head $ title $ text "minimal" body $ do args <- lift $ getArgs h1 $ text "minimal" div $ text $ "args passed to this program: " ++ (show args) I still think that (3) would be superiour.. Is there a way to define my own >>= and >> functions such as:
There is also the combinator approach of Text.Html, which gives you a syntax similar to (3) but without abusing "do": (rootElt ! [xmlns "http://www.w3.org/1999/xhtml", lang "en-US" >> xml:lang "en-US"]) $ concatXml [head $ title $ text "minimal" ,body $ concatXml [h1 $ text "minimal" ,div $ text $ "args passed to this program: " ++ (show args) ] ] You use concatXml (it's concatHtml in the library) followed by a list, instead of do, for nesting. (Also, it's stringToHtml instead of text in the library.) A few more brackets, but still pretty clean. Also, you'll have pass in your args from somewhere else, in the IO monad - which is probably a better design anyway. Regards, Yitz

Oops, needed to convert one more >> into a comma: (rootElt ! [xmlns "http://www.w3.org/1999/xhtml" ,lang "en-US" ,xml_lang "en-US" ]) $ concatXml etc. -Yitz

There is also the combinator approach of Text.Html, which gives you a syntax similar to (3) but without abusing "do":
(rootElt ! [xmlns "http://www.w3.org/1999/xhtml", lang "en-US" >> xml:lang "en-US"]) $ concatXml [head $ title $ text "minimal" ,body $ concatXml [h1 $ text "minimal" ,div $ text $ "args passed to this program: " ++ (show args) ] ]
Keep in mind that my library tries to do real DTD type checking. This means that body, h1, div all have different types. So they can't easily be put into a list. And yes: I care about each character I have to type less :-) About IO (). IO was just a poor man example to show that you can nest arbitrary monads. Have a look at the WASH library to see in which wonderful ways this can be used.. One working snippet from the testXHTML.hs sample file provided by the lib: #include "vxmldos.h" tDo $ runHtmlDoc $ vdo head $ title $ text "text" body $ vdo script $ X.type "text/javascript" >> text "document.writeln('hi');" h2 $ text "That's a headline, hello and how do you do?" -- br e eg a <br/> is not allowed here div $ vdo onclick "alert('clicked');" styleA "color:#F79" text "text within the div" div e return "That's nice, isn't it?" vxmldos.h defines a vdo cpp macro which expands to let .... ; in $ do where ... rebinds >>=, >>, lift to make this work (in with ghc head) The library can now cope with fancy dts such as (a*)* etc as well If you still find bugs let me know. The more interesting part starts now: figuring out how do build a nice HTML library on top of this all. Sincerly Marc Weber
participants (2)
-
Marc Weber
-
Yitzchak Gale