Content Assembly Mechanism (CAM) News: June 2010

Wednesday, June 23, 2010

Who needs OASIS any more?

Is the day rapidly approaching where we have essentially made ourselves redundant?

I gave testimony today in New York on their proposed legislation on open government and open data. This whole world of Web 3.0 collaborative information and Gov 2.0 solutions is supposed to be self-describing as per the W3C and their “open data” RDF syntax.

More telling is that the community is driven by “give us the data” and they care little for how they get there; spreadsheets, csv, text, XML, or RDF. So injecting a structured approach into the conversation is difficult given the myriad of agendas being advanced and the timelines desired. Only once the data Tower of Babel becomes intractable under its own weight will some perspective of the long term challenges become apparent? So perhaps a parallel path is possible, where these data providers go off and get the data and meanwhile someone else begins the process of cataloguing the data entities and content models involved and looking to move toward an aligned future view. This could be the opportunity for OASIS to develop “CityHallML” providing the most common components of financial, infrastructure and performance data along with census and demographic data.

The city of course is looking at the costs of getting data and then supporting that with archives and updates and publication feeds. Having a harmonized approach can potentially significantly reduce deployment and sustainment costs along with potential software development collaboration and cost savings for cities themselves. Having a common view also of course helps solution providers market to cities nationally not just locally. Perhaps the biggest challenge is the unspoken one of complexity. The more one steps into data sharing one sees the opportunity for people to interpose complexity. Keeping things simple, yet consistent and transparent requires constant vigilance and oversight to ensure that solution providers are not injecting their own self-serving complexity. After all complexity costs money to build and support, is a barrier to competitors, and hence vendors are naturally drawn to inject complexity.

Today I also published a paper on creating dictionaries of canonical components, aligned to the NIEM.gov approach and CEFACT core components model. This would be a great foundation for CityHallML.

http://www.oasis-open.org/committees/document.php?document_id=38385

This juxtaposes with the W3C world view of self-describing data instances and RDF. You have the approach of either the embedded RDF semantics, with all that overhead on each and every data item, or you have this OASIS-based approach of semantics referenced in domain dictionaries and information structure templates that allow comparatively small concise data instances where the XML tags provide the content referencing between content and semantics about the content.

Equally important is that the canonical components are built using naming and design rules (NDR) that drive consistency of approach and convergence on terms and meaning.

This all contrasts with today’s approach of publishing mega-structure as a Schema that contains all possible exchange components for every facet of a business process. This then forces developers to unravel the puzzle of what each part of the business process needs from that mega-structure, often sending redundant or empty data elements, instead of dynamic content assembly templates using selected parts from a dictionary of canonical components.

Now, lets assume everyone drinks this OASIS Cool-Aid - they create domain dictionaries of canonical components, and then use shared open source tooling to create their information structures dynamically and the tooling takes care of all the plumbing, templates, extraction and creation of XML instances from backend data stores, and submission to XML online repositories for archiving and exposure through search and retrieval services.

Vendors and government collaborate to develop and deploy open source based portals that allow further sharing and open access to data. Additional niche services using collaborative social platform tools integrate into these and deliver a wealth of community facing solutions to citizens.

Life is good.

This means technically no one strictly needs to publish formal exchange structure schema any more, exchanges are dynamically built to purpose by the communities. We already saw this need happen recently during the Haiti relief effort, when OASIS Emergency EDXL had to be extended on the fly to support on the ground situations with hospitals and the services they can provide.

So then who needs OASIS any more at this point? How do we make ourselves relevant in this uber Web 3.0 world and data sharing dominated by XML based services driven by today’s technology underpinnings of SOAP, REST, RSS and http, IETF and W3C speak with RDF?

Where is the long term picture for OASIS in all this? And is this necessarily a bad thing? We could just decide that "mission accomplished" is the good answer.

After all the recent demise of "shareware" shows that the world has got the message in marketing terms of building solutions that are open source and publish and use open publicly released standards. Then these are managed with peer review within a user community and a stakeholder authoritative source providing stewardship. Perhaps providing collaboration support services to such communities could be another OASIS facility.

We could envision that there would be the need for a triumvirate to manage and steward the go forward where federal, state and local government stakeholders need independent oversight and technology guidance. This is similar to what NIEM.gov is currently doing federally and then perhaps as New York and other states would seek for.

Of course many vendors are out there pitching their wares and setting up stall, figuring if they can own a states data then they essentially have a license to print money from those needing access to data or pushing targeted advertizing content at them along with the data they seek. In New York was heard testimony that “It’s only a small monthly fee or one time subscription for a week’s access to what you need and we have analysts to help you”. Notice also that Microsoft has created OData to publish RSS driven feeds that link also into SharePoint, and then Google has its own open data APIs available and associated search tools.

So for OASIS this could be positioning in terms of long term objectives keeping data sharing truly open without the dominance of particular solution providers at the expense of smaller community based services, or even the community itself. Information empowers democracy but can also be used to track and restrict freedoms of those who would seek that truth and equality. Asking suspicious questions can incur penalties or allow law enforcement to track potential suspects.

Plus the traditional areas of formal legislated transactional information exchanges for secure B2B use may provide OASIS with continued niche work. Although the gap there will continue to blur as the use case for open data encroaches on transactional data and network speeds continue to erode what is thought of as optimized high volume exchanges with small transactions.

Then that is also blurred by building contextual business process driven data exchanges from components drawn from canonical dictionary collections with embedded links to open data sources; e.g. I send the city a price quote for items and embed reference links to my public company profile, my digital certificate public key registered with the city, and the links and references to item descriptions published by the city for the RFP. The city itself then on contract award can simply publish that same information as was submitted as the bid.

This ushers in a very collaborative new world.

So perhaps another answer is augmenting the OASIS Kavi with Web 3.0 enabled portals and services that can publish canonical dictionaries of component definitions to help drive standardization out there in the domain user communities? Thereby keeping OASIS central to conversations around authoritative sources for good high quality components for use in building collaborative spaces?

Then just maybe the answer lies beyond data and into rule sharing and systems.

If we have solved information sharing then the next piece of the puzzle is open sharing of the under laying rules and trap doors that can snag the unwary. Clearly rule sharing systems are the next step up from just data sharing because they have to be built on top of consistent information representations.

Back in the day in 1998 when we started the XML/edi work we talked about "The Fusion of Five" - XML, EDI, repositories, templates and agents.

Each of these represents:

XML - web foundation
EDI - business methods
Repositories - reference component dictionaries
Templates - process logic for exchanges
Agents - implementation control and intelligent automation tools

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.100.3149&rep=rep1&type=pdf

Checking off the first 4 here as becoming closed chapters in the brave new Web 3.0 world, so is the agent piece the next great frontier? We are already seeing related work such as the OASIS SET TC that is providing a framework for information mapping automation.

Clearly the world is redefining what is perceived as possible and what requires better solutions and standard representations.

# posted by DRRW @ 7:35 AM 0 comments

Open Government Open Data - Testimony to New York City hearing

In the context of open data - Web page data is what people traditionally think of - but that is unstructured data - it is not predictable. It is typically served via portals. This makes it difficult for search engines to harvest and index and link and hence harder to find.

Structured data is used instead to publish data sets that can be freely accessed via a data mine. As data is added to the data mine however, it becomes important to align to information vocabulary standards so that information can be catalogued and discovered consistently.

Not all data is created equally. Without a vocabulary, lexicon and approach you end up with a lot of noise in the system that actually inhibits access to data as the mine grows.

Classification by cost to implement and value to the public casts insights on what the underlying challenges are. Making it possible to make data much more readily available with lower costs to not only deploy the data but then update periodically and archive older datasets is therefore technically the challenge.

Future proofing this data is also another lesson learned as technology moves extremely rapidly as we know.

What this teaches is a paradox. We want to do data right and standardize how that is done on the one hand; but we do not want that prescriptive so it blocks out new technology or innovations. We all know the risks of vendor or community lock-in so that only a limited set of providers can play or you need their special software. Using open public standards and open source friendly technology are therefore key needs.

Coupled to this is how the information can be accessed. Again, avoiding proprietary access methods is vital and using open public services that the internet infrastructure can readily access. So rather than adopting “open” vendor APIs which may be initially alluring, being careful to ensure those are not then the sole sources of data is equally important. Other public sources of directly accessing the complete raw data are a key need. Notice also that vendors APIs are two way, so they can harvest who is requesting what types of information and that may or may not be a concern for the public.

Perhaps the biggest challenge is the unspoken one of complexity. The more one steps into data sharing one sees the opportunity for people to interpose complexity. Keeping things simple, yet consistent and transparent requires constant vigilance and oversight to ensure that solution providers are not injecting their own self-serving complexity. After all complexity costs money to build and support and hence vendors are naturally drawn to inject complexity.

What this teaches is that New York could usefully invest in the development of open source software resources, and tapping into existing open source projects to leverage what is there today and act as a lead for development of data. By focusing resources on this aspect it facilitates programming and development both internally and externally. Rather than paying over and over for the same vendors developing proprietary solutions for the city.

One area that is instructive is the work of the National Information Exchange Model (NIEM.gov) and the challenges and opportunities this illustrates for federal government.

Providing a forum where government and industry can work together, so that government is not handing over the keys to data to a select few vendor companies.

Similarly reducing the cost of interoperable sharing of information exchanges by providing pre-built common vocabularies for domains that establish a common reference point is vital. This allows agencies to map their local data to the common lexicon.

This universal standard for federal government can provide a blueprint for state and local use. What is more, open source tools are available that help accelerate development of NIEM-based applications.

What NIEM is built on is the use of XML data representation techniques and open public standards for modelling and presenting data. Following these guidelines is essential to having consistent information. Allied to this using software tools to check the integrity and consistency of the data is important, because as data grows it becomes increasingly difficult to manage and check this by hand. Currently the City has published 200 datasets. How consistent are these today? Since the city is publishing data in different formats, databases, spreadsheets, text and XML it is likely it is not very consistent at all. By the way it should be mentioned it is easy to create XML data that is also spreadsheet compatible. Not many people know that nugget and this shows some of the technical challenges and knowledge sharing needed.

Last but not least an enlightening example is the case of the State of California election real time results reporting data that is broadcast on election day; this took a lot to get the State to do in terms of selecting the right technology, getting the data into an open OASIS XML public election data reporting format, and then signing up media channels to consume and report this information to the public. The past three years news channels across the state have quietly integrated this into their live website feeds.

Notice the challenges this shows for legislators and implementers. Who is the authoritative source, and how does the public verify that data they are being shown is accreditable and trustable? How do you select or foster the development of proper public standards for use with critical information feeds? How do you facilitate good aspects of data sharing such as auditing and verification of data that is being provided, including being able to have consistent data that can make historical trends.

By the way it should be noted that California found it technically easy to dump their SQL database election data directly into the OASIS XML reporting format. These types of simple script approaches would be open techniques that New York could learn from and leverage.

So what California has done on the one hand is good – making media sources the custodians, using internal election standards for presenting the information – but on the other hand is weak in that they are restricting public access to the original source and an archive of such data. Having procedures and methods in place so that the public can request access to information so that there is control over accessing more sensitive types of data, and understanding who may serve as public sources of trust is important to making more data available.

So in summary, using open public standards and resources that use open source software and leveraging open internet communication and information access methods is crucial and not slipping into a situation where a limited set of industry vendors then essentially own your data with increasingly levels of complexity to accessing that data.

# posted by DRRW @ 7:26 AM 0 comments

Content Assembly Mechanism (CAM) News

Wednesday, June 23, 2010

Who needs OASIS any more?

Open Government Open Data - Testimony to New York City hearing

Links

archives