Wednesday, June 23, 2010

 

Open Government Open Data - Testimony to New York City hearing

In the context of open data - Web page data is what people traditionally think of - but that is unstructured data - it is not predictable.  It is typically served via portals.  This makes it difficult for search engines to harvest and index and link and hence harder to find.

Structured data is used instead to publish data sets that can be freely accessed via a data mine.  As data is added to the data mine however, it becomes important to align to information vocabulary standards so that information can be catalogued and discovered consistently.

Not all data is created equally.  Without a vocabulary, lexicon and approach you end up with a lot of noise in the system that actually inhibits access to data as the mine grows.

Classification by cost to implement and value to the public casts insights on what the underlying challenges are.  Making it possible to make data much more readily available with lower costs to not only deploy the data but then update periodically and archive older datasets is therefore technically the challenge.

Future proofing this data is also another lesson learned as technology moves extremely rapidly as we know. 

What this teaches is a paradox.  We want to do data right and standardize how that is done on the one hand; but we do not want that prescriptive so it blocks out new technology or innovations.  We all know the risks of vendor or community lock-in so that only a limited set of providers can play or you need their special software.  Using open public standards and open source friendly technology are therefore key needs.   

Coupled to this is how the information can be accessed.  Again, avoiding proprietary access methods is vital and using open public services that the internet infrastructure can readily access.  So rather than adopting “open” vendor APIs which may be initially alluring, being careful to ensure those are not then the sole sources of data is equally important.  Other public sources of directly accessing the complete raw data are a key need.  Notice also that vendors APIs are two way, so they can harvest who is requesting what types of information and that may or may not be a concern for the public.

Perhaps the biggest challenge is the unspoken one of complexity.  The more one steps into data sharing one sees the opportunity for people to interpose complexity.  Keeping things simple, yet consistent and transparent requires constant vigilance and oversight to ensure that solution providers are not injecting their own self-serving complexity.  After all complexity costs money to build and support and hence vendors are naturally drawn to inject complexity.

What this teaches is that New York could usefully invest in the development of open source software resources, and tapping into existing open source projects to leverage what is there today and act as a lead for development of data.  By focusing resources on this aspect it facilitates programming and development both internally and externally.  Rather than paying over and over for the same vendors developing proprietary solutions for the city.

One area that is instructive is the work of the National Information Exchange Model (NIEM.gov) and the challenges and opportunities this illustrates for federal government.

Providing a forum where government and industry can work together, so that government is not handing over the keys to data to a select few vendor companies.

Similarly reducing the cost of interoperable sharing of information exchanges by providing pre-built common vocabularies for domains that establish a common reference point is vital.  This allows agencies to map their local data to the common lexicon.

This universal standard for federal government can provide a blueprint for state and local use.  What is more, open source tools are available that help accelerate development of NIEM-based applications.

What NIEM is built on is the use of XML data representation techniques and open public standards for modelling and presenting data.  Following these guidelines is essential to having consistent information.  Allied to this using software tools to check the integrity and consistency of the data is important, because as data grows it becomes increasingly difficult to manage and check this by hand.  Currently the City has published 200 datasets.  How consistent are these today?  Since the city is publishing data in different formats, databases, spreadsheets, text and XML it is likely it is not very consistent at all.  By the way it should be mentioned it is easy to create XML data that is also spreadsheet compatible.  Not many people know that nugget and this shows some of the technical challenges and knowledge sharing needed. 

Last but not least an enlightening example is the case of the State of California election real time results reporting data that is broadcast on election day; this took a lot to get the State to do in terms of selecting the right technology, getting the data into an open OASIS XML public election data reporting format, and then signing up media channels to consume and report this information to the public.  The past three years news channels across the state have quietly integrated this into their live website feeds. 

Notice the challenges this shows for legislators and implementers.  Who is the authoritative source, and how does the public verify that data they are being shown is accreditable and trustable?  How do you select or foster the development of proper public standards for use with critical information feeds?  How do you facilitate good aspects of data sharing such as auditing and verification of data that is being provided, including being able to have consistent data that can make historical trends.

By the way it should be noted that California found it technically easy to dump their SQL database election data directly into the OASIS XML reporting format.  These types of simple script approaches would be open techniques that New York could learn from and leverage.

So what California has done on the one hand is good – making media sources the custodians, using internal election standards for presenting the information – but on the other hand is weak in that they are restricting public access to the original source and an archive of such data.  Having procedures and methods in place so that the public can request access to information so that there is control over accessing more sensitive types of data, and understanding who may serve as public sources of trust is important to making more data available.

So in summary, using open public standards and resources that use open source software and leveraging open internet communication and information access methods is crucial and not slipping into a situation where a limited set of industry vendors then essentially own your data with increasingly levels of complexity to accessing that data.

Comments: Post a Comment

<< Home

This page is powered by Blogger. Isn't yours?