Why Carry the Cost of Linked Data?

In his ongoing series of niggles about Linked Data, Rob McKinnon claims that “mandating RDF [for publication of government data] may be premature and costly“. The claim is made in reference to Francis Maude’s parliamentary answer to a question from Tom Watson. Personally I see nothing in the statement from Francis Maude that implies the mandating of RDF or Linked Data, only that “Where possible we will use recognised open standards including Linked Data standards”. Note the “where possible”. However, that’s not the point of this post.

There’s nothing premature about publishing government data as Linked Data – it’s happening on a large scale in the UK, US and elsewhere. Where I do agree with Rob (perhaps for the first time ;)) is that it comes at a cost. However, this isn’t the interesting question, as the same applies to any investment in a nation’s infrastructure. The interesting questions are who bears that cost, and who benefits?

Let’s make a direct comparison between publishing a data set in raw CSV format (probably exported from a database or spreadsheet) and making the extra effort to publish it in RDF according to the Linked Data principles.

Assuming that your spreadsheet doesn’t contain formulas or merged cells that would make the data irregularly shaped, or that you can create a nice database view that denormalises your relational database tables into one, then the cost of publishing data in CSV basically amounts to running the appropriate export of the data and hosting the static file somewhere on the Web. Dead cheap, right?

Oh wait, you’ll need to write some documentation explaining what each of the columns in the CSV file mean, and what types of data people should expect to find in each of these. You’ll also need to create and maintain some kind of directory so people can discover your data in the crazy haystack that is the Web. Not quite so cheap after all.

So what are the comparable processes and costs in the RDF and Linked Data scenario? One option is to use a tool like D2R Server to expose data from your relational database to the Web as RDF, but let’s stick with the CSV example to demonstrate the lo-fi approach.

This is not the place to reproduce an entire guide to publishing Linked Data, but in a nutshell, you’ll need to decide on the format of the URIs you’ll assign to the things described in your data set, select one or more RDF schemata with which to describe your data (analogous to defining what the columns in your CSV file mean and how their contents relate to each other), and then write some code to convert the data in your CSV file to RDF, according to your URI format and the chosen schemata. Last of all, for it to be proper Linked Data, you’ll need to find a related Linked Data set on the Web and create some RDF that links (some of) the things in your data set to things in the other. Just as with conventional Web sites, if people find your data useful or interesting they’ll create some RDF that links the things in their data to the things in yours, gradually creating an unbounded Web of data.

Clearly these extra steps come at a cost compared to publishing raw CSV files. So why bear these costs?

There are two main reasons: discoverability and reusability.

Anyone (deliberately) publishing data on the Web presumably does so because they want other people to be able to find and reuse that data. The beauty of Linked Data is that discoverability is baked in to the combination of RDF and the Linked Data principles. Incoming links to an RDF data set put that data set “into the Web” and outgoing links increase the interconnectivity further.

Yes, you can create an HTML link to a CSV file, but you can’t link to specific things described in the data or say how they relate to each other. Linked Data enables this. Yes, you can publish some documentation alongside a CSV file explaining what each of the columns mean, but that description can’t be interlinked with the data itself, making it self-describing. Linked Data does this. Yes, you can include URIs in the data itself, but CSV provides no mechanism that for indicating that the content of a particular cell is a link to be followed. Linked Data does this. Yes, you can create directories or catalogues that describe the data sets available from a particular publisher, but this doesn’t scale to the Web. Remember what the arrival of Google did to the Yahoo! directory? What we need is a mechanism that supports arbitrary discovery of data sets by bots roaming the Web and building searchable indices of the data they find. Linked Data enables this.

Assuming that a particular data set has been discovered, what is the cost of any one party using that data in a new application? Perhaps this application only needs one data set, in which case all the developer must do is read the documentation to understand the structure of the data and get on with writing code. A much more likely scenario is that the application requires integration of two or more data sets. If each of these data sets is just a CSV file then every application developer must incur the cost of integrating them, i.e. linking together the elements common to both data sets, and must do this for every new data set they want to use in their application. In this scenario the integration cost of using these data sets is proportional to their use. There are no economies of scale. It always costs the same amount, to every consumer.

Not so with Linked Data, which enables the data publisher to identify links between their data and third party data sets, and make these links available to every consumer of that data set by publishing them as RDF along with the data itself. Yes, there is a one-off cost to the publisher in creating the links that are most likely to be useful to data consumers, but that’s a one-off. It doesn’t increase every time a developer uses the data set, and each developer doesn’t have to pay that cost for each data set they use.

If data publishers are seriously interested in promoting the use of their data then this is a cost worth bearing. Why constantly reinvent the wheel by creating new sets of links for every application that uses a certain combination of data sets? Certainly as a UK taxpayer, I would rather the UK Government made this one-off investment in publishing and linking RDF data, thereby lowering the cost for everyone that wanted to use them. This is the way to build a vibrant economy around open data.

11 Responses to “Why Carry the Cost of Linked Data?”


  1. Tweets that mention Why Carry the Cost of Linked Data? – Tom Heath’s Displacement Activities -- Topsy.com

    [...] This post was mentioned on Twitter by Tom Heath and Paul Geraghty, Monika Lechner. Monika Lechner said: RT @tommyh: "Why Carry the Cost of Linked Data?" http://is.gd/cRouh (new blog post) #linkeddata #semanticweb #datagovuk @delineator @tom … [...]

  2. Kingsley Idehen

    Tom,

    To cut a long story short. RDF formats provide one of several vehicles for publishing Linked Data on public or private HTTP based networks.

    RDF isn’t accepted as a Data Model, its a Data Representation mechanism (Markup). The base Data Model is EAV (with EAV/CR when going into the realms of TBox and ABox partitioning). Linked Data adds the use of HTTP based URIs for Names re. Entity, Entity Attribute, and Entity Attribute Value slots re. the base models 3-tuple structure.

    EAV model data can be expressed in a myriad of representation formats (which includes the RDF family).

    RDF has its own merits wrt. to real object modeling fidelity. Thus, we don’t need to conflate Data Representation and Data Model in order for RDF to succeed, based on its on tangible virtues.

    Important Note:
    TimBL’s original design issues doc never made mention of RDF or SPARQL as mandatory. Even today, I strongly believe his current edition references RDF and SPARQL (in parenthesis) as examples of relevant standards that aid publication of Linked Data.

    Links:

    1. http://bit.ly/cA0zxw — Data 3.0 Manifesto (decouples RDF from Linked Data).

  3. John S. Erickson, Ph.D.

    Great post, Tom!

    In my view the question is analogous to that of (web) API design: ultimately, to be success, it must be usable!

    In his great Javapolis keynote, How to Design a Good API and Why it Matters Josh Bloch of Google laid out these characteristics of a “Good API”:

    • Easy to learn
    • Easy to use, even without documentation
    • Hard to misuse
    • Easy to read and maintain code that uses it
    • Sufficiently powerful to satisfy requirements
    • Easy to extend
    • Appropriate to audience

    Bloch argues that with these properties in mind, a public API can be an organization’s “greatest asset.” I would argue that for data-intensive organizations, datasets published as linked data with these same characteristic could well be their most valuable asset!

  4. Tony Hirst

    I think one of the things you get in a CSV data set is a view over a set of data.
    When I discover a CSV data set, via Google or wherever (;-), using human search terms, I may or may not get some of the following:

    - the data I was looking for;
    - an identifiable source;
    - related data (so eg I search for population of london and get a dataset that contains the population of other places too).

    To the extent that a CSV table is a full text searchable file, if it has been indexed I can throw in my arbitrary and possibly not very good search terms, and maybe the index will throw a dataset back (eg as a CSV file) that contains the datum I want.

    In terms of discovering the data, this is really hit and miss, really flaky, and dead simple.

    Cf. how many people would use Google books or Amazon’s one search box rather than enter the right things in the right boxes in a Library OPAC (or hack an even purer Z39.50 query from a terminal command line (can you even do this?!))

    I know I come coming across as anti-SPARQL and anti-Linked Data, but I’m not. I just think that the barriers to entry are too many at the moment for enough people to see how they might benefit…

    IMHO, if the datastores started publishing interesting and maybe frequently requested data views as full text searchable and discoverable csv, and alongside linking to a sparql query that would return that same data in a richer form, it would help….

    Then I as a novice an stumble across the thing I was maybe looking clumsily for, do what I want with it, maybe realise it’s not quite what I want see the Sparql, not understand a world of what it says but see one part of it that I can guess at changing (hmmm…. /bluerghhh London wibble/…. I wonder if /bluerghhh Manchester wibble/ works too…? Hmmm, looks like between two dates? what happens if i chance those dates… and so on…)

    I think in terms of weird surface areas relating to discoverability and usability of this stuff. CSV is discoverable over a large area, (search engines can handle it, local datastores, etc); it’s useable over a wide area (lots of tools import it, lots of peope use tools that can import it).

    You’re a scientist, I’m a pragmatist. But you’re maybe thinking a little bit too like the librarians (we’ll be having no tags here, we just want proper controlled vocabularies) as they upheld the proper way of doing things and they users Googled ?

    Doh! I don’t mean to say Linked Data is not the way forward, I just don’t think it’s yet a representation that large numbers of people would feel comfortable or capable of working with, given what they currently know, what they currently do, and they culturally currently do it…

  5. John S. Erickson, Ph.D.

    Regarding Kingsley’s comment concerning EAV: I think we all get that EAV has historical precedence and is a more fundamental model than RDF, but for practical purposes why should adopters concern themselves with it?

    To my knowledge EAV has no equivalent community like OpenRDF.org supporting an open source reference implementation; there is no Python library like RDFlib for working with EAV; there is no standardized set of attributes like RDFa for embedding assertions within Web documents…

    You’ve stated above,

    …EAV model data can be expressed in a myriad of representation formats (which includes the RDF family)…

    From a practical standpoint, what are those formats and why should adopters care?

  6. Kingsley Idehen

    As per my response on Twitter. My focus is demystification of Linked Data. To achieve this goal I am rewinding the Linked Data story back to the very beginning. Positioning RDF as a Data Model (rather than Markup for a Data Model) is problematic. Our 12 year comprehension odyssey is living proof.

    Inserting EAV into the Linked Data conversation is about creating foundation for understanding roots of the model that drives RDF data representation formats. Basically, how its been enhanced by the addition of URIs re., basic RDF and generic HTTP URIs re. RDF based Linked Data.

    Kingsley

  7. Matthias Samwald

    Somehow I am in the mood for playing devil’s advocate today…

    *) You take into account that choosing linked data over CSV comes with added costs for the publisher. You omit that in the current state of the world, it comes with added costs for the consumers as well. Most developers don’t know much about RDF and surrounding tools and standards, so they have to learn about it in order to consume your dataset. These costs can easily outweigh potential benefits. Of course, the mission of the linked data community is to change that fact by popularizing RDF technologies and standards, so that might not be true anymore 5 years from now.
    *) I guess interlinking could also be partially be tackled by adopting simple conventions for adding URLs and URIs as values ins CSV rows. For example, referencing Wikipedia URLs. CSV files and rows in these files could then be related to each other via shared Wikipedia references. This would not be that different from the current Linked Data cloud, where DBpedia acts as a central hub for many other datasets.

    Cheers,
    Matthias

  8. Kingsley Idehen

    Just realized this post would benefit from a link to the Wikipedia article about: EAV/CR [1].

    Link:

    1. http://en.wikipedia.org/wiki/Entity-attribute-value_model

    Kingsley

  9. Scott Banwart's Blog » Blog Archive » Distributed Weekly 55

    [...] Why Carry the Cost of Linked Data? [...]

  10. Confluence: Área de Tecnología

    Argumentario Linked Data…

    Argumentario APIs Vs Linked Data…

  11. Ed Summers

    Not sure if this was covered in any of the comments already … but there are also costs associated with the consumer: csv loads pretty nice into excel. What does the average joe load RDF into?