| Adopting XML: Tomorrow's Web |
|
||
|---|---|---|---|
|
|
|
||
A brief chautauqua on language
What is XML
Status of XML
Benefits of XML
Document Type Definitions and dialects of XML
Anatomy of an XML system
XML in your context
Review of XML Tools and Technologies
Baker
an English word
Boulangier
not an English word
We can recognise words as belonging to a language because we know them... (sometimes, we can recognise words as belonging to a language even when we don't know them, because they sound right).
Colourless green ideas sleep furiously.
an English sentence.
although it is nonsense, we can easily parse it and see that it is structurally correct
Development state with join material and.
not an English sentence
although there could be sensible parses in some grammars, we can easily see that in English grammar this is not structurally correct.
In any given language, we can easily recognise what is a well formed component of the language.
And what is not...
Fish
an Indo-European word
Peske
Choiremheir
Chautauqua
Gkprtwcv
P7ajo
Although language families have rules about what can be in a word and what can't, it's much harder to tell whether a word is valid or not, unless we know which language we're looking at.
This is not a pipe.
Ceci n'est pas une pipe.
Chagco vet nici yan toube.
GGGGGG #000007 cabala.
Within a family of languages, we can recognise what is a well-formed component of some language
or might be...
and what certainly isn't.
In Indo-European languages,
a word has at least one vowel
Usually...
no word has more than four consonants in succession
a sentence is a succession of words
a sentence starts with a capital letter and ends with a period
There is an (implicit) meta-grammar.
<address>
A valid HTML tag (in HTML 4.0 Transitional)
<cotton>
Not a valid HTML tag
HTML is a language (albeit a simple one).
we can know at once whether a tag is a valid HTML tag or not...
and what it means...
and how it should be used...
When we know what the language is we can parse ill-formed forms:
been there, done that
because we can predict what the missing bits are
and where they should be:
I have been there, and I have done that.
Key features
Differences from HTML
Differences from SGML
Reality check
A language for describing other languages
Markup languages
Which describe the structure of a document
Not the visual appearance (CSS1, CSS2, XSL)
Written in simple UniCode (a sixteen-bit replacement for ASCII)
The visual appearance of a document should be controlled by style sheets.
The appearance of this one is.
In XML as in HTML you don't have to use style sheets.
If you don't, you will get a plain, simple appearance.
If people are interested, you can open the style sheet for this presentation, slideshow.css, in a text editor.
A special style sheet language, XSL, has been written to support the new features of XML.
The current draft is complex: some serious people argue, too complex.
Michael Leventhal's argument
Hakon Lie's argument.
The current draft (21st April 1999) is very different from the preceding version (16th December 1998).
It isn't finished yet and the eventual standard may be very different again.
Microsoft IE5 implements a proprietary 'XSL' which is based on an older draft of XSL and is quite different from the current XSL draft specification...
Cocoon's XSL interpreter implements a form of 'XSL' which is based on an older draft of XSL and appears to be quite different from the current XSL draft specification...
Beware of tutorials (including this one, if you are reading it on the Web): they may well be out of date!
The standard is now broken down into two parts:
The argument is now mainly about XSL-FO
You can continue to use existing CSS1 and CSS2 stylesheets.
Probably.
Depending on what individual client vendors decide to support...
Netscape Gecko and MSIE5 both (roughly) support CSS1.
This presentation is not about style sheets.
A Metalanguage: In a word, extensible.
Also, strictly parsed.
Allows you to define new markup.
Describing structure, not appearance.
Makes it easier for programs to extract information from your documents.
<?xml version="1.0"?>
<!DOCTYPE meeting PUBLIC "-//WEFT//DTD MEETING 0.1//EN" "meeting.dtd">
<meeting id="June Board Meeting">
<venue>
28 Forth Street, Edinburgh
</venue>
<invitees>
<attendee attendance="required" meeting-role="convenor">
<name>
Simon Brooke
</name>
<position>
Technical Director
</position>
</attendee>
<attendee attendance="required">
<name>Angela Stormont</name>
<position>
Communications Director
</position>
</attendee>
</invitees>
</meeting>
What does this do?
For the user directly, very little.
For the user's program, it allows it to isolate items of structured information and handle them in intelligent ways to help the user.
But only if the user's program understands the special markup you have defined.
Documents which are not well-formed will not be rendered by an XML browser.
At all.
Tags and attributes are case-sensitive;
End tags cannot be omitted - every <p> must have a </p>.
Tags must be correctly nested:
<b><i>This won't work</b></i>
Empty tags (those which don't enclose any content) must be marked with a trailing slash like this: <xx/>
Most Web designers are sloppy.
More than ninety percent of all commercially authored Web pages do not conform to any standard and are not valid HTML.
Few if any of the commercially available WYSIWYG tools generate valid HTML.
Web authors switching to XML will need to adopt much more rigorous technical discipline.
Like HTML, much simpler!
Like HTML, optimised for delivery over restricted-bandwidth links.
Unlike HTML, a true subset of SGML.
All valid XML documents are valid SGML documents.
SGML tools (conforming to ISO 8879) will work with XML.
Organisations with an existing committment to SGML will find the transition to XML much simpler.
Where is this process at?
Will this really happen?
Will I have to change what I do?
A draft standard has been published.
Software tools are emerging.
Microsoft IE5 supports XML.
Netscape's (beta) Gecko browser supports XML
Prediction: commercially important in two years, dominant in four.
All the major players are involved.
Many emerging standards depend on XML.
XML has many advantages over HTML.
Prediction: it will really happen.
But Microsoft in particular may be trying to break the standardisation process
Some observers see Microsoft's BizTalk as an attempt to split the XML community.
'What XML proponents fear most is that the major software makers will use their financial clout to hijack the consensus-building process, leading to proprietary and incompatible versions of XML schemas that favor a particular vendor's software and architecture.'
HTML documents will continue to work with XML browsers.
May need to be more correct than at present.
An implementation of HTML in XML (XHTML) has been completed
Correct HTML documents (i.e. ones which validate) are very easy to convert to XML.
Tools for automatic conversion are available.
Increasingly, search engines will depend on XML-based metadata.
Increasingly, mainstream browsers will exploit XML-based metadata.
Conclusion: no, but...
There will still be a market for the sort of graphics-led 'brochureware' which now dominates the Web
Prediction: The market for systems which exploit the merits of XML will be far larger.
Other significant players
The standards body for the Web.
An open organisation, anyone can join.
Not for profit.
All major players are members.
Should your organisation be?
Driving the standardisation process for XML.
Issued the XML 1.0 Recommendation in February 1998.
Using XML as the basis for emerging standards in privacy, multimedia, etc.
SMIL: Synchronised Multimedia Integration Language A set of XML extensions to handle embedded multimedia.
PICS: Platform for Internet Content Selection A means of labelling the content of documents based on criteria of taste - mainly motivated by people who want to protect children from sexually explicit material.
RDF: Resource Description Framework A standard for structuring embedded metadata - making it easier for programs to understand documents.
This slide is included simply to give some feel of the scale of the XML project...
The Australia New Zealand Land Information Council (ANZLIC) - Metadata
XML Metadata Interchange Format (XMI) - Object Management Group (OMG)
FIXML - A Markup Language for the FIX Application Message Layer
OpenMLS - Real Estate DTD Design
Newspaper Association of America (NAA) - Classified Ads Format
WEBDAV (IETF 'Extensions for Distributed Authoring and Versioning on the World Wide Web')
Publicly committed to XML.
Tim Bray of Netscape and Textuality is joint editor of XML standard document.
Developing XML-based indexing protocol, MCF.
Navigator 4 does not handle XML.
"...includes XML support..."
Source code is available
Publicly committed to XML.
Jean Paoli of Microsoft is joint editor of XML standard document
Developing XML-based indexing protocol, PICS
Have an XML parser written in Java
Have an XML authoring tool, XML Notepad
Internet Explorer 4 partially handles XML
Future Microsoft Office Applications 'will use XML as default file format'.
Supporting BizTalk framework for XML e-commerce
Have a collection of XML tools available for free download, including an XML Browser (written in Java) - but I can't make it work!
IBM are also putting a great deal of effort into XML tools and DTDs for electronic commerce. Many of the tools are available for free download.
Supporters of oasis' XML.org standards definition community.
Inviting suggestions for an XML strategy!
More seriously, have also released a collection of XML tools in Java.
Supporters of oasis' XML.org standards definition community.
Example: Chemical Markup Language
SMIL: Glitz and eye-candy
Meta information frameworks: benefits for searching and indexing
The need: a means for chemists to exchange information about molecules and chemical compounds in an application-neutral format.
An application of XML
Specified by a special-purpose DTD
Allows chemists to interchange information
Peter Murray-Rust, University of Nottingham, England.
Prototype software.
Developed for displaying and editing CML, but claimed to be general-purpose XML browser.
Architecturally interesting.
Very opaque and certainly nowhere near user-ready!
I'm including this because in my experience the browser is so hard to use you may not be able to make it do anything the course will see as interesting.
Scalable, interactive diagrams linked into text.
Links from objects in the diagrams into the text.
Complex special notation.
The XML source code for this demo
Ignore the window called 'SGMLTree'.
When the 'TableOfContents window opens, it's too small and you have to resize it.
I think there should be some way of rendering the document as a readable document, but I can't work out how!
Open the folder marked 'Assignments'
Click on any of the little circles by the assignments.
A window opens with a picture of the molecule.
A window opens with predicted spectroscopy of the molecule.
You'll probably have to resize these.
Rotate the molecule.
Click on the highlighted atom and note how a highlight appears on the graph.
Click on the little circles against other assignments and note how the graph and molecule change.
Finally, look at the XML source for this demo.
No, I don't really understand it either - I'm not a chemist!.
Notice how simple the XML source is.
Notice how small the XML source is.
XML is used to display complex technical information to a specialist audience in a form that audience will understand.
Essentially a framework for embedding multimedia objects, so that they can intercommunicate
Has some built-in multimedia capabilities
Optimised for low bandwidth
Designed to make it easy to add new handlers for new multimedia formats
First commercial implementation: Real Networks G2
Say 'Smile'!
NB: You should ensure you have the G2 beta of RealPlayer to view this. As of November 1998, bugs in the server prevented the presentation from completely working over the Internet, but what does work on a 28k modem is still worth showing. To demonstrate the complete presentation, download the source and media in advance from here .
What is a DTD?
Do I have to use a DTD?
What DTDs are available?
Who will write DTDs?
Essentially, a dictionary for the language you are using.
Every Web author has heard of one
Every good Web author has seen one
Very few Web authors have written one
As with HTML, you don't have to specify a DTD.
Even if you define new markup...
... but client programs won't know how to interpret your new markup unless you also define a DTD.
As with HTML, you should specify a DTD...
... and we all do, don't we, children?
All the XML extensions discussed in this presentation are defined as DTDs.
Thousands of SGML DTDs are available which can relatively easily be converted.
Relatively few XML DTDs are yet available, but the number is growing.
Some repositories:
Very specialised documents, technically demanding to write.
For most purposes, suitable DTDs will quickly become available.
Most Web authors will never write a DTD.
Large organisations with special documentation requirements may write DTDs.
Communities of organisations which wish to exchange data will probably write DTDs.
Corporations which sell word-processors will probably write DTDs.
Corporations which sell WYSIWYG Web authoring tools will certainly write DTDs.
In future, there will be much less distinction between a word processor and a Web authoring tool.
Communities of interest with special technical needs will certainly write DTDs.
A client reads a DTD to decide how to interpret elements in a document...
It reads a stylesheet to decide how to display those elements...
But how does it find functionality?
JavaScript/ECMAScript?
It's reasonable to expect XML supporting clients to support ECMAScript, so that it should be possible to pass scripts which add functionality within the document. Efficiency costs? Limits on what you can do?
Java?
It's reasonable to expect XML supporting clients to support Java, so it should be possible to pass Java components or applets to add functionality.
Using the Document Object Model (DOM)?
Either of the above will interact with the client through the DOM; in principle it should be possible to write DOM-aware components in any other language, provided you can be sure of finding an appropriate environment at the client end.
XSL-T?
This is really another way of applying one of the above: a transform of the document could create or request an appropriate software component which could then be run. XSL-T also supports software 'hooks', but in practice this again depends on the appropriate environment at the client end.
Gecko, IE5 cannot present SMIL...
Example: a meeting arranger system
Creating an example document (quite easy)
Creating the DTD (hard, but we'll use a trick)
Viewing it: creating a style-sheet (harder)
Using it: applications
We all go to meetings...
We all know what a hassle it is arranging them...
Wouldn't it be nice if the machines could do it for us?
Here's how!
Start by typing what you want into your favourite text editor.
Invent sensible looking markup as you go along.
Don't be too casual about this
Here's one I did earlier.
This is a good opportunity for a whiteboard and some interaction! If possible, get the participants to do an example for themselves.
A DTD is a precise, technical document. How are we going to make one?
Pass our example page to the DTDGenerator
Tidy up the results with your text editor
Here's one I did earlier.
Again, if possible, get the participants to actually do this.
Two approaches to stylesheets:
(Of course, you can just do without altogether)
We'll do one just for the agenda.
Now we need to write applications which will:
allow us to generate these documents
not very hard, there are Java components around which semi-automate creating a form-driven special-purpose editor from a DTD...
allow our diary programs to automatically handle these documents
much harder, but XML parser libaries are available for most modern programming languages which you can build on.
We're not going to do that today.
Applications which will benefit greatly from XML
Applications which will benefit little from XML
Early adoption: arguments for
Organisations which should aim to be early adopters
Wait and see: arguments for
Organisations which should aim to wait and see
Hybrid strategy: arguments for
Organisations which should adopt a hybrid strategy
Technical documentation applications, or applications involving special notation (e.g., mathematics, music).
Applications incorporating client-end software agents.
Accounting systems exchanging orders, invoices, payments...
Engineering systems exchanging specifications, dimensions...
Diary systems exchanging bookings, events, meetings, holidays...
Applications requiring highly detailed illustrations.
Multimedia applications.
At present, only where the audience is controlled
Simple publishing of text, with or without simple graphics.
But even here the advantages of XML will gradually take over.
Better representation of technical information.
Improved software-to-software communication of meaning.
Improved multimedia capabilities.
Improved indexing capabilities.
Development of new skills which will very rapidly become important.
Technical information
Specialist search
Gaining Experience
Technically competent business communities.
Publishers in multiple formats
Organisations which distribute large quantities of technical information to a targeted audience should adopt early.
Improved indexing and searching allows better navigation of the documents.
Special markup can be used to help document users and maintainers understand document structure.
Technical notation can be easily incorporated where required.
Vector graphics allow 'zoomable' detail.
Engineering companies distributing technical manuals.
Companies exchanging technical specifications.
Science and research establishments publishing technical information.
Organisations which publish volumes of reference information which users typically search should adopt early.
Improved indexing and searching allows better navigation of the documents.
Special markup allows client-side user agents to understand document structure and select 'interesting' information.
News providers, especially upstream news providers such as PA and Reuters.
Market information providers.
Online libraries.
Search engines.
Organisations which view the Web as core to their business should adopt early.
XML is radically different from HTML:
Much richer, more powerful;
Technically much more demanding.
New skillsets and tools will be needed to publish effectively in XML.
The learning curve is steep.
Web authors and Web production companies.
Especially, software houses which make tools for Web authoring;
Still a considerable market opportunity in which new starters might succeed.
Exchange tenders, orders, specifications, invoices automatically
Leaders doing this now -- e.g. Dell
Major committment by the whole trading community
All need to be competent
Several competing proposed standards
Publish to both print and Web from the same document.
With appropriate format for each medium.
Successful on-line retailers: organisations with a big investment in existing Web technology, which is paying off for them, do not need to change now, but should track the technology.
Organisations with simple 'brochureware' Web sites will not need to change for some time.
Organisations for whom the Web is not a core part of their business do not need to change now.
Remember: better tools will emerge; this is still a bleeding edge.
Parsers
Editors
Browsers
Database integration and Middleware
Server-side tools
Everyone and his dog seems to have written an XML parser in Java:
Chris Hubick (this one can be played with on line)
There are also a few parsers available in C, Python, etc...
Parsers are essential technology if you want to build user-level tools for XML, but, by themselves, don't do anything useful for the average user.
Everyone and his dog seems to have written an incomprehensible XML editor:
University of Edinburgh XED,
Not incomprehensible, but little more than a simple text editor with syntax checking.
In Java so platform-independent.
Microsoft XML Notepad
Tree-view style editor.
Windows only.
'Tree-View' style editor similar to Jumbo's tree-view.
Windows only.
Combines 'Tree-view' with highlighted source text, similar to early versions of HotMetal.
Windows and Solaris only.
A great deal of thought has clearly gone into the user interface of this product.
Several user interface styles are available.
All are completely incomprehensible to me.
In Java, so platform independent.
Based on existing SGML editor.
More than usually incomprehensible tree view.
Windows only.
Contains a typically opaque 'XML editor', based on incomprehensible boxes rather than incomprehensible tress.
Available on Macintosh.
It is remarkable that anyone imagines that real work can be done with tools of this quality.
Fortunately, Emacs SGML mode handles XML (mostly very well).
Text editor with syntax highlighting and well-formedness verification.
context-sensitive menu of valid tags generated by parsing the DTD declaration
No WYSIWYG view.
Runs on all <joke>proper computers</joke>.
(Yes, of course this presentation was written in EMACS).
There are good, established SGML tools, but these may be too expensive or too technically demanding (or both) for most Web production houses.
Most of these tools are targetted twowards the management of large, complex documents (which makes sense for XML) but will look strange to people use to HTML editing tools.
Now being branded as an XML content tool, Interleaf is a long established suite of SGML tools including WYSIWYG editors and content management tools. Cost: unstated but high.
Now owned by Adobe, FrameMaker+SGML is an advanced Desktop Publishing package which understands SGML and has user-oriented facilities for managing stylesheets and DTDs. Cost: about $2000 per seat.
Originally coming from the technical documentation market, Arbortext is another serious publishing toolset now being retargeted at the XML market. The toolkit includes a wide range of input filters, and has a module which links into Microsoft Word, as well as it's own end-user oriented editor.
Cost: about $2,000 per seat, DTD compiler is more.
A number of other SGML editing tools can also be used for XML.
Serif Software have a set of extensions for Quark Express to do 'WYSIWYG' editing of XML. Ordinary users could use this (with a little training) and it is available now.
Excosoft Documentor is... odd. It's a quirky outliner which handles SGML and XML, and works (but you have to get used to how). Very nearly quite good: ordinary users could use this (with a little training) and it is available now.
XMetaL may be the holy grail: a near WYSIWYG editor which parses your DTD and lays out your document according to your CSS1 stylesheet, but also allows you to see tree views or tag views if that's what you prefer. [MS-Windows only]
Intended for real users to really use.
Released now
Intended for real users to really use.
Still beta, but increasingly functional.
Source code available.
(Apparently) intended for real users to really use.
Looks like a browser.
In Java (1.1), so platform independent.
Pure XML browser - won't render ill-formed HTML.
Beta, still quite buggy.
Demonstrates interesting new functionality not possible with HTML.
Edits as well as browses.
Extremely counter-intuitive and hard to use
Looks like a browser!
Lays out simple documents straightforwardly.
All menu text, help text, et cetera, in Japanese.
Can't layout any of the documents from the Jumbo or SMIL demos
SGML Browsers may also work.
As XML will be used for complex software-to-software information interchange, persistent, searchable storage of XML objects is important
What orders have we had from Acme Widgets in the past month?
Two principle approaches:
Object Oriented Databases
Create a database schema directly from the DTD
Example: Object Design's Excelon
RDBMS
Essentially produce a very shallow schema which handles XML syntax, and pour any XML document into it
But see also Software AG's Tamino, which claims to be a native XML database.
Proposals are beginning to emerge for XML specific dynamic content tools:
Apache eXtensible Server Pages (XSP)
Existing dynamic content tools can also be used to generate XML content:
Server-side scripting languages:
Why you would want to do this? because most users don't have XML aware browsers yet.
Produced by the Apache project, free under Apache licence.
Written in Java, using the Servlet API, and abstract interfaces to the parser and the output formatter
Run it on the (Java 1.2 aware) platform of your choice
With the (Servlet aware) HTTPD of your choice
With the XML parser of your choice
And the output formatter of your choice
Most of the available XML browsers can't even render the demos distributed with the other XML browsers.
None of the purpose-built XML editing tools appears suitable for creating large or complex document structures. Most of them are unusable.
There are still very considerable market opportunities.
This presentation is available online.
Simon Brooke has been a technical consultant in advanced software applications for thirteen years. He advises on the development of software architectures and systems, primarily for Internet and Intranet application.
As a consultant, Simon has advised many blue chip companies, primarily in the IT, Telecoms and Chemical industries, on the application and development of advanced software systems.