Archive for the 'Linked Data' Category

Bebo White Reviews the Linked Data Book for Journal of Web Engineering

I recently had an email giving advance notice that a review of the Linked Data Book (aka “Linked Data: Evolving the Web into a Global Data Space“) would appear in Volume 11(2) of the Journal of Web Engineering, published by Rinton Press (ISSN: 1540-9589). As some people won’t have easy access to the journal, the review is republished here, with permission. It’s by Bebo White of Stanford University and beyond — thank you Bebo for the thoughtful review, and to Rinton Press for allowing it to be republished here.

Web Engineering has been described as encompassing those “technologies, methodologies, tools, and techniques used to develop and maintain Web-based applications leading to better systems, [thus to] enabling and improving the dissemination and use of content and services though the Web.” (Source: International Conference on Web Engineering)

An especially interesting aspect of this description is “dissemination and use of content.” Semantic Web technologies and particularly the Linked Data paradigm have evolved as powerful enablers for the transition of the current document-oriented Web into a Web of interlinked data/content and, ultimately, into the Semantic Web.

To facilitate this transition many aspects of distributed data and information management need to be adapted, advanced and integrated. Of particular importance are approaches for (1) extracting semantics from unstructured, semi-structured and existing structured sources, (2) management of large volumes of RDF data, (3) techniques for efficient automatic and semi-automatic data linking, (4) algorithms, tools, and inference techniques for repairing and enriching Linked Data with conceptual knowledge, (5) the collaborative authoring and creation of data on the Web, (6) the establishment of trust by preserving provenance and tracing lineage, (7) user-friendly means for browsing, exploration and search of large, federated Linked Data spaces. Particularly promising might be the synergistic combination of approaches and techniques touching upon several of these aspects at once.

For Web Engineering practitioners interested in being a part of this Web transition, Linked Data – Evolving the Web into a Global Data Space by Heath and Bizer will provide a valuable resource. The authors have done an excellent job of addressing the subject in a logical sequence of well-written chapters reflecting technical fundamentals, coverage of existing applications and tools, and the challenges for future development and research. The seven important approaches mentioned earlier are described in a consistent way and illustrated by means of a hypothetical scenario that evolves over the course of the book. The size of this book (122 pages) is deceiving in that it does not reflect the quality and density of its content. The authors have succeeded in presenting a complex topic both succinctly and clearly. It is not a “quick read,” but rather a volume to be used for references, definitions, and meaningful and instructive code examples.

This book is available in digital format (PDF). It is the first in a planned series of books/lectures. The quality of this book should make the reader/practitioner look forward to the upcoming series volumes that promise to further explain the exciting future of this topic.

The Linked Data Book: Draft Table of Contents

Update 2011-02-25: the book is now published and available for download and in hard copy:

Original Post

Chris Bizer and I have been working over the last few months on a book capturing the state of the art in Linked Data. The book will be published shortly as an e-book and in hard copy by Morgan & Claypool, as part of the series Synthesis Lectures in Web Engineering, edited by Jim Hendler and Frank van Harmelen. There will also be an HTML version available free of charge on the Web.

I’ve been asked about the contents, so thought I’d reproduce the table of contents here. This is the structure as we sent it to the publisher — the final structure my vary a little but changes will likely be superficial. Register at Amazon to receive an update when the book is released.

  • Overview
  • Contents
  • List of Figures
  • Acknowledgements
  • Introduction
    • The Data Deluge
    • The Rationale for Linked Data
      • Structure Enables Sophisticated Processing
      • Hyperlinks Connect Distributed Data
    • From Data Islands to a Global Data Space
    • Structure of this book
    • Intended Audience
    • Introducing Big Lynx Productions
  • Principles of Linked Data
    • The Principles in a Nutshell
    • Naming Things with URIs
    • Making URIs Defererencable
      • URIs
      • Hash URIs
      • Hash versus
    • Providing Useful RDF Information
      • The RDF Data Model
        • Benefits of using the RDF Data Model in the Linked Data Context
        • RDF Features Best Avoided in the Linked Data Context
      • RDF Serialization Formats
        • RDF/XML
        • RDFa
        • Turtle
        • N-Triples
        • RDF/JSON
    • Including Links to other Things
      • Relationship Links
      • Identity Links
      • Vocabulary Links
    • Conclusions
  • The Web of Data
    • Bootstrapping the Web of Data
    • Topology of the Web of Data
      • Cross-Domain Data
      • Geographic Data
      • Media
      • Government Data
      • Libraries and Education
      • Life Sciences
      • Retail and Commerce
      • User Generated Content and Social Media
    • Conclusions
  • Linked Data Design Considerations
    • Using URIs as Names for Things
      • Minting HTTP URIs
      • Guidelines for Creating Cool URIs
        • Keep out of namespaces you do not control
        • Abstract away from implementation details
        • Use Natural Keys within URIs
      • Example URIs
    • Describing Things with RDF
      • Literal Triples and Outgoing Links
      • Incoming Links
      • Triples that Describe Related Resources
      • Triples that Describe the Description
    • Publishing Data about Data
      • Describing a Data Set
        • Semantic Sitemaps
        • voiD Descriptions
      • Provenance Metadata
      • Licenses, Waivers and Norms for Data
        • Licenses vs. Waivers
        • Applying Licenses to Copyrightable Material
        • Non-copyrightable Material
    • Choosing and Using Vocabularies
      • SKOS, RDFS and OWL
      • RDFS Basics
        • Annotations in RDFS
        • Relating Classes and Properties
      • A Little OWL
      • Reusing Existing Terms
      • Selecting Vocabularies
      • Defining Terms
    • Making Links with RDF
      • Making Links within a Data Set
        • Publishing Incoming and Outgoing Links
      • Making Links with External Data Sources
        • Choosing External Linking Targets
        • Choosing Predicates for Linking
      • Setting RDF Links Manually
      • Auto-generating RDF Links
        • Key-based Approaches
        • Similarity-based Approaches
  • Recipes for Publishing Linked Data
    • Linked Data Publishing Patterns
      • Patterns in a Nutshell
        • From Queryable Structured Data to Linked Data
        • From Static Structured Data to Linked Data
        • From Text Documents to Linked Data
      • Additional Considerations
        • Data Volume: How much data needs to be served?
        • Data Dynamism: How often does the data change?
    • The Recipes
      • Serving Linked Data as Static RDF/XML Files
        • Hosting and Naming Static RDF Files
        • Server-Side Configuration: MIME Types
        • Making RDF Discoverable from HTML
      • Serving Linked Data as RDF Embedded in HTML Files
      • Serving RDF and HTML with Custom Server-Side Scripts
      • Serving Linked Data from Relational Databases
      • Serving Linked Data from RDF Triple Stores
      • Serving RDF by Wrapping Existing Application or Web APIs
    • Additional Approaches to Publishing Linked Data
    • Testing and Debugging Linked Data
    • Linked Data Publishing Checklist
  • Consuming Linked Data
    • Deployed Linked Data Applications
      • Generic Applications
        • Linked Data Browsers
        • Linked Data Search Engines
      • Domain-specific Applications
    • Developing a Linked Data Mashup
      • Software Requirements
      • Accessing Linked Data URIs
      • Representing Data Locally using Named Graphs
      • Querying local Data with SPARQL
    • Architecture of Linked Data Applications
      • Accessing the Web of Data
      • Vocabulary Mapping
      • Identity Resolution
      • Provenance Tracking
      • Data Quality Assessment
      • Caching Web Data Locally
      • Using Web Data in the Application Context
    • Effort Distribution between Publishers, Consumers and Third Parties
  • Summary and Outlook
  • Bibliography

Arguments about HTTP 303 Considered Harmful

Ian recently published a blog post that he’d finally got around to writing, several months after a fierce internal debate at Talis about whether the Web of Data needs HTTP 303 redirects. I can top that. Ian’s post unleashed a flood of anti-303 sentiment that has prompted me to finish a blog post I started in February 2008.

Picture the scene: six geeks sit around a table in the bar of a Holiday Inn, somewhere in West London. It’s late, we’re drunk, and debating 303 redirects and the distinction between information and non-information resources. Three of the geeks exit stage left, leaving me to thrash it out with Dan and Damian. Some time shortly afterwards Dan calls me a “303 fascist”, presumably for advocating the use of HTTP 303 redirects when serving Linked Data, as per the W3C TAG’s finding on httpRange-14.

I never got to the bottom of Dan’s objection – technical? philosophical? historical? -  but there is seemingly no end to the hand-wringing that we as a community seem willing to engage in about this issue.

Ian’s post lists nine objections to the 303 redirect pattern, most of which don’t stand up to closer scrutiny. Let’s take them one at a time:

1. it requires an extra round-trip to the server for every request

For whom is this an issue? Users? Data publishers? Both?

If it’s the former then the argument doesn’t wash. Consider a typical request for a Web page. The browser requests the page, the server sends the HTML document in response. (Wait, should that be “an HTML representation of the resource denoted by the URI“, or whatever? If we want to get picky about the terminology of Web architecture then lets start with the resource/representation minefield. I would bet hard cash that the typical Web user or developer is better able to understand the distinction between information resources and non-information resources than between resources and representations of resources).

Anyway, back to our typical request for a Web page… The browser parses the HTML document, finds references to images and stylesheets hosted on the same server, and quite likely some JavaScript hosted elsewhere. Each of the former requires another request to the original server, while the latter triggers requests to other domains. In the worst case scenario these other domains aren’t in the client’s (or their ISP’s) DNS cache, thereby requiring a DNS lookup on the hostname and increasing the overall time cost of the request.

In this context, is a 303 redirect and the resulting round-trip really an issue for users of the HTML interfaces to Linked Data applications? I doubt it.

Perhaps it’s an issue for data publishers. Perhaps those serving (or planning to serve) Linked Data are worried about whether their Web servers can handle the extra requests/second that 303s entail. If that’s the case, presumably the same data publishers insist that their development teams chuck all their CSS into a single stylesheet, in order to prevent any unnecessary requests stemming from using multiple stylesheets per HTML document. I doubt it.

My take home message is this: in the grand scheme of things, the extra round-trip stemming from a 303 redirect is of very little significance to users or data publishers. Eyal Oren raised the very valid question some time ago of whether 303s should be cached. Redefining this in the HTTP spec seems eminently sensible. So why hasn’t it happened? If just a fraction of the time spent debating 303s and IR vs. NIR was spent lobbying to get that change made then we would have some progress to report. Instead we just have hand-wringing and FUD for potential Linked Data adopters.

2. only one description can be linked from the toucan’s URI

Do people actually want to link to more than one description of a resource? Perhaps there are multiple documents on a site that describe the same thing, and it would be useful to point to them both. (Wait, we have a mechanism for that! It’s called a hypertext/hyperdata link). But maybe someone has two static files on the same server that are both equally valid descriptions of the same resource. Yes, in that case it would be useful to be able to point to both; so just create an RDF document that sits behind a 303 redirect and contains some rdfs:seeAlso statements to the more extensive description, or serve up your data from an RDF store that can pull out all statements describing the resource, and return them as one document.

I don’t buy the idea that people actually want to point to multiple descriptions apart from in the data itself. If there are other equivalent resources out there on the Web then state their equivalence, don’t just link to their descriptions. There may be 10 or 100 or 1000 equivalent resources referenced in an RDF document. 303 redirects make it very clear which is the authoritative description of a specific resource.

3. the user enters one URI into their browser and ends up at a different one, causing confusion when they want to reuse the URI of the toucan. Often they use the document URI by mistake.

OK, let’s break this issue down into two distinct scenarios. Job Public who wants to bookmark something, and Joe Developer who wants to hand-craft some RDF (using the correct URI to identify the toucan).

Again, I would bet hard cash that Joe Public doesn’t want to reuse the URI of the toucan in his bookmarks, emails, tweets etc. I would bet that he wants to reuse the URI of the document describing the toucan. No one sends emails saying “hey, read this toucan“. People say “hey, read this document about a toucan“. In this case it doesn’t matter one bit that the document URI is being used.

Things can get a bit more complicated in the Joe Developer scenario, and the awful URI pattern used in DBpedia, where it’s visually hard to notice the change from /resource to /data or /page, doesn’t help at all. So change it. Or agree to never use that pattern again. If documents describing things in DBpedia ended .rdf or .html would we even be having this debate?

Joe Developer also has to take a bit of responsibility for writing sensible RDF statements. Unfortunately, people like Ed seeming to conflate himself and his homepage (and his router and its admin console) don’t help with the general level of understanding. I’ve tried many times to explain to someone that I am not my homepage, and as far as I know I’ve never failed. In all this frantic debate about the 303 mechanism, let’s not abandon certain basic principles that just make sense.

I don’t think Ian was suggesting in his posts that he is his homepage, so let’s be very, very explicit about what we’re debating here — 303 redirects — and not muddy the waters by bringing other topics into the discussion.

4. its non-trivial to configure a web server to issue the correct redirect and only to do so for the things that are not information resources.

Ian claims this is non-trivial. Nor is running a Drupal installation. I know, it powers linkeddata.org, and maintaining it is a PITA. That doesn’t stop thousands of people doing it. Let’s be honest, very little in Web technology is trivial. Running a Web server in your basement isn’t trivial – that’s why people created wordpress.com, Flickr, MySpace, etc., bringing Web publishing to the masses, and why most of us would rather pay Web hosting companies to do the hard stuff for us. If people really see this configuration issue as a barrier then they should get on with implementing a server that makes it trivial, or teach people how to make the necessary configuration changes.

5. the server operator has to decide which resources are information resources and which are not without any precise guidance on how to distinguish the two (the official definition speaks of things whose “essential characteristics can be conveyed in a message”). I enumerate some examples here but it’s easy to get to the absurd.

The original guidance from the TAG stated that a 200 indicated an information resource, whereas a 303 could indicate any type of resource. If in doubt, use a 303 and redirect to a description of the resource. Simple.

6. it cannot be implemented using a static web server setup, i.e. one that serves static RDF documents

In this case hash URIs are more suitable anyway. This has always been the case.

7. it mixes layers of responsibility – there is information a user cannot know without making a network request and inspecting the metadata about the response to that request. When the web server ceases to exist then that information is lost.

Can’t this be resolved by adding additional triples to the document that describes the resource, stating the relationship between a resource and its description?

8. the 303 response can really only be used with things that aren’t information resources. You can’t serve up an information resource (such as a spreadsheet) and 303 redirect to metadata about the spreadsheet at the same time.

Metadata about an RDF document can be included in the document itself. Perhaps a more Web-friendly alternative to Excel could allow for richer embeddable metadata.

9. having to explain the reasoning behind using 303 redirects to mainstream web developers simply reinforces the perception that the semantic web is baroque and irrelevant to their needs.

I fail to see how Ian’s proposal, when taken as a whole package, is any less confusing.

~~~

Having written this post I’m wondering whether the time would have been better spent on something more productive, which is precisely how I feel about the topic in general. As geeks I think we love obsessing about getting things “right”, but at what cost? Ian’s main objection seems to be about the barriers we put in the way of Linked Data adoption. From my own experience there is no better barrier than uncertainty. Arguments about HTTP 303s are far more harmful than 303s themselves. Let’s put the niggles aside and get on with making Linked Data the great success we all want it to be.

Why Carry the Cost of Linked Data?

In his ongoing series of niggles about Linked Data, Rob McKinnon claims that “mandating RDF [for publication of government data] may be premature and costly“. The claim is made in reference to Francis Maude’s parliamentary answer to a question from Tom Watson. Personally I see nothing in the statement from Francis Maude that implies the mandating of RDF or Linked Data, only that “Where possible we will use recognised open standards including Linked Data standards”. Note the “where possible”. However, that’s not the point of this post.

There’s nothing premature about publishing government data as Linked Data – it’s happening on a large scale in the UK, US and elsewhere. Where I do agree with Rob (perhaps for the first time ;)) is that it comes at a cost. However, this isn’t the interesting question, as the same applies to any investment in a nation’s infrastructure. The interesting questions are who bears that cost, and who benefits?

Let’s make a direct comparison between publishing a data set in raw CSV format (probably exported from a database or spreadsheet) and making the extra effort to publish it in RDF according to the Linked Data principles.

Assuming that your spreadsheet doesn’t contain formulas or merged cells that would make the data irregularly shaped, or that you can create a nice database view that denormalises your relational database tables into one, then the cost of publishing data in CSV basically amounts to running the appropriate export of the data and hosting the static file somewhere on the Web. Dead cheap, right?

Oh wait, you’ll need to write some documentation explaining what each of the columns in the CSV file mean, and what types of data people should expect to find in each of these. You’ll also need to create and maintain some kind of directory so people can discover your data in the crazy haystack that is the Web. Not quite so cheap after all.

So what are the comparable processes and costs in the RDF and Linked Data scenario? One option is to use a tool like D2R Server to expose data from your relational database to the Web as RDF, but let’s stick with the CSV example to demonstrate the lo-fi approach.

This is not the place to reproduce an entire guide to publishing Linked Data, but in a nutshell, you’ll need to decide on the format of the URIs you’ll assign to the things described in your data set, select one or more RDF schemata with which to describe your data (analogous to defining what the columns in your CSV file mean and how their contents relate to each other), and then write some code to convert the data in your CSV file to RDF, according to your URI format and the chosen schemata. Last of all, for it to be proper Linked Data, you’ll need to find a related Linked Data set on the Web and create some RDF that links (some of) the things in your data set to things in the other. Just as with conventional Web sites, if people find your data useful or interesting they’ll create some RDF that links the things in their data to the things in yours, gradually creating an unbounded Web of data.

Clearly these extra steps come at a cost compared to publishing raw CSV files. So why bear these costs?

There are two main reasons: discoverability and reusability.

Anyone (deliberately) publishing data on the Web presumably does so because they want other people to be able to find and reuse that data. The beauty of Linked Data is that discoverability is baked in to the combination of RDF and the Linked Data principles. Incoming links to an RDF data set put that data set “into the Web” and outgoing links increase the interconnectivity further.

Yes, you can create an HTML link to a CSV file, but you can’t link to specific things described in the data or say how they relate to each other. Linked Data enables this. Yes, you can publish some documentation alongside a CSV file explaining what each of the columns mean, but that description can’t be interlinked with the data itself, making it self-describing. Linked Data does this. Yes, you can include URIs in the data itself, but CSV provides no mechanism that for indicating that the content of a particular cell is a link to be followed. Linked Data does this. Yes, you can create directories or catalogues that describe the data sets available from a particular publisher, but this doesn’t scale to the Web. Remember what the arrival of Google did to the Yahoo! directory? What we need is a mechanism that supports arbitrary discovery of data sets by bots roaming the Web and building searchable indices of the data they find. Linked Data enables this.

Assuming that a particular data set has been discovered, what is the cost of any one party using that data in a new application? Perhaps this application only needs one data set, in which case all the developer must do is read the documentation to understand the structure of the data and get on with writing code. A much more likely scenario is that the application requires integration of two or more data sets. If each of these data sets is just a CSV file then every application developer must incur the cost of integrating them, i.e. linking together the elements common to both data sets, and must do this for every new data set they want to use in their application. In this scenario the integration cost of using these data sets is proportional to their use. There are no economies of scale. It always costs the same amount, to every consumer.

Not so with Linked Data, which enables the data publisher to identify links between their data and third party data sets, and make these links available to every consumer of that data set by publishing them as RDF along with the data itself. Yes, there is a one-off cost to the publisher in creating the links that are most likely to be useful to data consumers, but that’s a one-off. It doesn’t increase every time a developer uses the data set, and each developer doesn’t have to pay that cost for each data set they use.

If data publishers are seriously interested in promoting the use of their data then this is a cost worth bearing. Why constantly reinvent the wheel by creating new sets of links for every application that uses a certain combination of data sets? Certainly as a UK taxpayer, I would rather the UK Government made this one-off investment in publishing and linking RDF data, thereby lowering the cost for everyone that wanted to use them. This is the way to build a vibrant economy around open data.

Putting a Conference into the Semantic Web

Chris Gutteridge asked this question about semantically enabling conference Web sites, which is a subject close to my heart. It’s hard to give a meaningful response in 140 characters, so I decided to get some headline thoughts down for posterity. If you want a fuller account of some first-hand experiences, then the following papers are a good place to start:

Top Five Tips for Semantic Web-enabling a Conference

1. Exploit Existing Workflows

Conferences are incredibly data-rich, but much of this richness is bound up in systems for e.g. paper submission, delegate registration, and scheduling, that aren’t native to the Semantic Web. Recognise this in advance and plan for how you intend to get the data from these systems out into the Web. The good news is that scripts now exists to handle dumps from submission systems such as EasyChair, but you may need to ensure that the conference instance of these systems is configured correctly for your needs. For example, getting dumps from these systems often comes at a price, and if you’re using one instance per track rather than the multi-track options, you may be in for a shock when you ask for the dumps. Speak to the Programme Chairs about this as soon as possible.

In my experience, delegate registration opens months in advance of a conference and often uses a proprietary, one-off system. As early as possible make contact with the person who will be developing and/or running this system, and agree how the registration system can be extended to collect data about the delegates and their affiliations, for example. Obviously there needs to be an opt-in process before this data is published on the public Web.

Collecting these types of data from existing workflows is so monumentally easier than asking people to submit it later through some dedicated means. With this in mind, have modest expectations (in terms of degree of participation) for any system you hope to deploy for people to use before, during and after the conference, whether this is a personalised schedule planner, paper annotation system or rating system for local restaurants. People have massive demands on their time always, and especially at a conference, so any system that isn’t already part of a workflow they are engaged with is likely to get limited uptake.

2. Publish Data Early then Incrementally Improve

Perhaps your goal in publishing RDF data about your conference is simply to do the right thing by eating your own dog food and providing an archival record of the event in machine-readable form. This is fine, but ideally you want people to use the published data before and during the event, not just afterwards. In an ideal world, people will use the data you publish as a foundation for demos of their applications and services and the conference, as means to enhance the event and also to promote their own work. To maximise the chances of this happening you need to make it clear in advance that you will be publishing this data, and give an indication of what the scope of this will be. The RDF available from previous events in the ESWC and ISWC series can give an impression of the shape of the data you will publish (assuming you follow the same modelling patterns), but get samples out early and basic structures in place so people have the chance to prepare. Better to incrementally enhance something than save it all up for a big bang just one week before the conference.

3. Attend to the details

Many of the recent ESWC and ISWC events have done a great job of publishing conference data, and have certainly streamlined the process considerably. However, along the way we’ve lost (or failed to attend to) some of the small but significant facts that relate to a conference, such as the location, venue, sponsors and keynote speakers. This stuff matters, and is the kind of data that probably doesn’t get recorded elsewhere. Obviously publishing data about the conference papers is important, but from an archival point of view this information is at least recorded by the publishers of the proceedings. The more tacit, historical knowledge about a conference series may be of great interest in the future, but is at risk of slipping away.

4. Piggy-back on Existing Infrastructure

As I discovered while coordinating the Semantic Web Technologies for ESWC2006, deploying event-specific services is simply making a rod for your own back. Who is going to ensure these stay alive after the event is over and everyone moves onto the next thing? The answer is probably no-one. The domain-registration will lapse, the server will get hacked or develop a fault, the person who once knew why that site mattered will take a job elsewhere, and the data will disappear in the process. Therefore it’s critical that every event uses infrastructure that is already embedded in everyday usage and also/therefore has a future. The best example of this is data.semanticweb.org, the de facto home for Linked Data from Web-related events. This service has support from SWSA, and enough buy-in from the community, to minimise the risk that it will ever go away. By all means host the data on the conference Web site if you must, but don’t dream of not mirroring it at data.semanticweb.org, with owl:sameAs links to equivalent URIs in that namespace for all entities in your data set.

5. Put Your Data in the Web

Remember that while putting your data on the Web for others to use is a great start, it’s going to be of greatest use to people if it’s also *in* the Web. This is a frequently overlooked distinction, but it really matters. No one in their right mind would dream of having a Web site with no incoming or outgoing links, and the same applies to data. Wherever possible the entities in your data set need to be linked to related entities in other data sets. This could be as simple as linking the conference venue to the town in which it is located, where the URI for the town comes from Geonames. Linking in this way ensures that consumers of the data can discover related information, and avoids you having to publish redundant information that already exists somewhere else on the Web. The really great news is that data.semanticweb.org already provides URIs for many people who have published in the Semantic Web field, and (aside from some complexities with special characters in names) linking to these really can be achieved in one line of code. When it’s this easy there really are no excuses.

Conclusions

Reading the above points back before I hit publish, I realise they focus on Semantic Web-enabling the conference as a whole, rather than specifically the conference Web site, which was the focus of Chris’s original question. I think we know a decent amount about publishing Linked Data on the Web, so hopefully these tips usefully address the more process-oriented than technical aspects.

I’m not a lawyer, but…. ISWC2009 Tutorial on Data Licensing

As the Linked Data community shifts its emphasis from publishing data on the Web to consuming it in applications, one question inevitably arises: “what are the terms under which different data sets can be reused?” There’s been a considerable amount of time and money invested by Talis and others in providing some clarity in this area; work that has evolved into the Open Data Commons. However, there remains a lot of confusion in this area, as evidenced by this thread on the Linking Open Data mailing list. Clearly there is more education and outreach work to be done about how licenses and waivers can be applied to data.

With this in mind Leigh Dodds, Jordan Hatcher, Kaitlin Thaney and I submitted a tutorial proposal to this years International Semantic Web Conference addressing exactly these kind of issues. The good news is that our proposal has been accepted, and therefore there will be a half day tutorial on “Legal and Social Frameworks for Sharing Data on the Web” at ISWC2009 in Washington DC in October (more details online soon). With Jordan present to provide the legal perspective there’ll finally be someone taking part in the discussion who can’t prefix their statements with “I’m not a lawyer, but…”

The Semantic Web is the Cake…but the Technologies are not the Layers

My last post about the relationship between Linked Data, the Semantic Web and the Semantic Web technology stack seemed to create more debate and disagreement than clarity. Not to be discouraged by this, I’ve been giving some more thought to analogies that may help to illuminate the the relationship between different concepts in the Semantic Web space. This got me thinking about the Semantic Web layer cake.

The layer cake diagram is probably one of the most used and abused images associated with the Semantic Web vision. In the diagram, each technology or concept in the Semantic Web stack is a layer, with Crypto providing some kind of irregular icing down one side. I’d like to propose a different interpretation of the Semantic Web as a cake.

In my view, the technologies aren’t layers in the finished cake, they’re the raw ingredients that must be mixed and baked to make the cake that is the Semantic Web itself. URIs are the grains of flour; an ingredient that is essential but by itself rather bland, and lacking form and coherence. RDF triples are the egg that can bind together this URI flour. This cake is taking shape, but it’s lacking flavour. In the Semantic Web cake these flavourings, such as FOAF, SIOC, or the Programmes Ontology, are concocted on a base of RDFS and OWL. Simple cakes based on one or two flavours can be very tasty, but for our Semantic Web cake to be truly delicious we want a wide range of flavours, with some dominating others in different parts of the cake.

Once we’ve baked our cake, by putting our RDF data online according to the Linked Data principles, we’ll probably want to decorate it. Perhaps some icing or cherries on top, in the form of inferred RDF triples, would make it even more delicious. With such an appealing data cake it’s inevitable that people will want to consume it, but we have to make sure that everyone can have a slice rather than letting a big data gluton run off with the cake and deprive everyone else of this treat. We need some sort of knife; preferably one like SPARQL, that allows people to help themselves to the parts of the cake they like best. Will the cake baked and decorated, and will all the tools in place, it’s time to invite some friends round (maybe even some agents) and start consuming.

(I see that Jim Hendler’s keynote at ESWC2009 will talk about the layer cake; I’m intrigued to see how he choses to serve up the analogy).

Linked Data? Web of Data? Semantic Web? WTF?

This post was prompted by this tweet from Tim O’Reilly

People learning about Linked Data frequently ask “what’s the relationship between Linked Data and the Semantic Web?”, which is a fair and good question. One of the responses that crops up relatively frequently is that Linked Data is just an attempt to rebrand the Semantic Web. In my experience these kind of rebranding comments come mostly from people who have a certain impression of the Semantic Web vision (which may or may not be accurate), don’t like this vision, and therefore dismiss Linked Data on this basis without actually considering what it means (i.e. a means to dismantle data silos), and without necessarily rethinking their original view of the Semantic Web concept. I prefer to see it this way…

Think about HTML documents; when people started weaving these together with hyperlinks we got a Web of documents. Now think about data. When people started weaving individual bits of data together with RDF triples (that expressed the relationship between these bits of data) we saw the emergence of a Web of data. Linked Data is no more complex than this – connecting related data across the Web using URIs, HTTP and RDF. Of course there are many ways to have linked data, but in common usage Linked Data refers to the principles set out by Tim Berners-Lee in 2006.

So if we link data together using Web technologies, and according to these principles, the result is a Web of data. Personally I use the term Web of data largely interchangeably with the term Semantic Web, although not everyone in the Semantic Web world would agree with this. The precise term I use depends on the audience. With Semantic Web geeks I say Semantic Web, with others I tend to say Web of data – it’s not about rebranding, it’s about using terms that make sense to your audience, and Web of data speaks to people much more clearly than Semantic Web. Similarly, Linked Data isn’t about rebranding the Semantic Web, it’s about clarifying its fundamentals.

Tim Berners-Lee said several times last year, in public, that “Linked Data is the Semantic Web done right” (e.g. see these slides from Linked Data Planet in New York), and who am I to argue, it’s his vision. But to see this as a recent trend or a u-turn ignores the historical context. On page 191 of my copy of Weaving the Web (dated 2000, ISBN-13: 9781587990182) it says:

The first step is putting data on the Web in a form that machines can naturally understand, or converting it to that form. This creates what I call a Semantic Web – a web of data that can be processed directly or indirectly by machines.

I’m not sure this quote adequately captures the importance of links in the whole picture, but no one can claim that the Web of data label is recent marketing spin invented to make the Semantic Web palatable. This was always the deal. It’s certainly how I understood the concept (and what inspired me to do a PhD in the area).

If others  have somehow diverted the Semantic Web vision down some side road since Weaving the Web was written, then that’s unfortunate. (In my experience the Linking Open Data project was an attempt to reconnect the Semantic Web community with the some of the key aspects of the original vision that were being overlooked, like having a real Web of data as the basis for research). I certainly notice plenty of unjustified attempts at present to co-opt the term Semantic Web, now that it’s no longer a dirty word, and drive it off down some dodgy alleyway. Some of these products, services or companies may be applications or services that use some semantic technology and are delivered over the Web, but that doesn’t make them Semantic Web applications, services or companies. Anything claiming the Semantic Web label needs to get its hands dirty with Linked Data somewhere along the way. That’s just how it is.

So to return to Tim O’Reilly’s tweet, he’s not far wrong about the lack of difference between Linked Data, Semantic Web and RDF (we’ll ignore the means vs end vs technology distinction), but I’d love to know who he’s quoting about the explicit rebranding.

Linked Data Tutorials at Semantic Web Austin

I spent a few days last week in Austin, Texas, running two one-day tutorials about Linked Data. Juan was a great host, and the tutorials themselves were great fun (and significantly enhanced by the post-tutorial beers supplied by the very kind folk at The Guardian).  I was incredibly impressed by the energy, enthusiasm and foresight of the members of the Semantic Web Austin interest group that Juan and John De Oliveira kick-started. The city itself has an amazing can-do attitude, and this was reflected in the diverse group of attendees at the tutorials. It was great to see so many completely new faces, and see first hand that Linked Data appeals well beyond the traditional Semantic Web community. If Juan’s energy is anything to go by I wouldn’t be at all surprised if Austin storms to the position of Semantic Web capital of the USA by the end of 2009. The slides from the tutorial are online on my site (PDF, 2.8M), and the photos of work and beer are on Flickr.

On Snake Oil

Greetings, from the shady corner of the marketplace, where dubious characters tell tales of substances with mystical properties, and push their wares on unsuspecting passersby…

Today I had the dubious privilege of being branded a snake oil salesman, on the grounds that my “boosting” of the Semantic Web isn’t backed up by adequate eating of my own Semantic Web dog food. Apparently neither my publications page, or any of my other pages on my site, have any “intelligent content tagging”, whatever that is (I assume this means RDFa).

If it does mean RDFa, then true, but this does completely overlook the RDF/XML on my site, which as a whole is built according to Linked Data principles. More galling is that the claim completely overlooks the work I’ve done in the Semantic Web community that kick-started a lot of the ongoing dog food activity at ESWC and ISWC  (this was not a lone effort by any means: Knud Moeller, Sean Bechhofer, Chris Bizer, Richard Cyganiak and many, many others deserve as much or more credit as I do for ensuring it continued, as will Michael Hausenblas and Harith Alani at ESWC2009). Just to rub salt into the wound, these inaccuracies are being propagated across the Web in other people’s blog comments, e.g. here.

I’d like to end with some insightful meta analysis or reflection on this, but unfortunately I need to get ready for a trip to the US to run a workshop on Visual Interfaces to the Social and Semantic Web and give two days worth of Linked Data tutorials. Hope I don’t get stopped at US customs with that consignment of snake oil ;) So, no great insights for now, just a copy of my response in case it ever disappears from the original site in a puff of smoke.

——–

Seth,

I would be the first to agree with you that the Semantic Web community has not always eaten its own dogfood to the extent that it should have. It was for exactly this reason that in 2006 I produced RDF descriptions of almost all aspects of the European Semantic Web Conference (http://www.eswc2006.org/rdf/) and coordinated the deployment of numerous Semantic Web technologies at the conference (http://www.eswc2006.org/technologies/). My aim was to learn about deploying these technologies in the wild, and feed back my findings (positive or negative) to the community. The results of my evaluation were published here: http://swui.semanticweb.org/swui06/papers/Heath/Heath.pdf

Regarding the production of RDF to describe Semantic Web conferences, there had been some small efforts in this direction at previous events, but nothing comprehensive. ESWC2006 changed that for good, and there have been RDF descriptions of all European and International Semantic Web conferences published ever since. This data has been published using an ontology that derives largely from the one I created for ESWC2006, with significant contributions along the way from others. There is now a regular position on the organising committee of these conferences for people charged with coordinating this effort for the event. Knud Moeller and I shared this role at ISWC2007, where we also reported back to the community on our efforts up to that point: http://iswc2007.semanticweb.org/papers/795.pdf. Many other people have contributed significantly along the way, and this combined effort has produced the repository of data at http://data.semanticweb.org/ to which RDF descriptions of ESWC2009 will also be added.

But as you point out, the institutions that promote the Semantic Web also need to put their money where their mouth is. Agreed. While I was a PhD student at The Open University’s Knowledge Media Institute, I argued for developer time to add RDF descriptions about all KMi members to the institute’s People pages (http://kmi.open.ac.uk/people/), and tutored the developers in how to apply their existing Web development skills to exposing Semantic Markup.

My PhD work included development of the reviewing and rating site Revyu.com (http://revyu.com/), which won first prize in the 2007 Semantic Web Challenge. I can’t speak for the judges, but my hunch is that a major factor in Revyu’s success in the Challenge stemmed from its strict adherence to the Linked Data principles (http://www.w3.org/DesignIssues/LinkedData.html), which have done so much to help people make the Semantic Web a reality. Revyu publishes human-readable (i.e. HTML) and machine-readable (i.e. RDF) content side by side, but humans won’t see this RDF (I assume this is what you mean by “intelligent content tagging”) unless they know where to look; this is the intended behaviour, and works according to the techniques described in the How to Publish Linked Data on the Web tutorial (http://linkeddata.org/docs/how-to-publish) that I co-authored with Chris Bizer and Richard Cyganiak.

My personal Web site follows the same principles and uses the same techniques. If you view the source of my homepage you will see a link tag in the header that looks like this:
<link rel=”meta” type=”application/rdf+xml” title=”RDF” href=”http://tomheath.com/home/rdf” />
This is the link that tells Semantic Web crawlers to look elsewhere for the semantic markup on my site, not in the human-readable HTML page where it might get broken if I tweak the layout. If we ever meet in person I will give you one of my business cards, which doesn’t give the address of my homepage – it gives my Web URI (http://tomheath.com/id/me); humans and machines can look up this URI and retrieve information about me in a form that suits them (i.e. HTML or RDF), and follow links in that HTML or RDF to other related information. In the words of Tim Berners-Lee, this setup is “the Semantic Web done right, and the Web done right”.

Yes, it’s unfortunate that my publications page doesn’t have an RDF equivalent; perhaps I’ve been too busy investing time and energy in initiatives that will have an impact beyond the scope of my own Web site? But either way, your comment that “nor any of his other pages (that I saw) uses any form of intelligent content tagging” just doesn’t stack up. Before you make these sorts of claims I would ask, reasonably and politely, that you show due diligence in looking thoroughly, and in the right places, for the semantic markup on my site. For anyone who is in any doubt that it’s there, click on the small “RDF META” tile on the right hand side of pages on my site.

I think it’s also reasonable to expect, if you’re truly interested in how the Semantic Web community is tackling this issue, that I might be given the chance to respond to your queries in advance of this article going out, as Adrian Paschke and Alexander Wahler were. I can only hope that this response helps provide a fuller picture of the situation, with respect to my efforts and those of the community at large.

Lastly, a technical point. We need to remember that the Semantic Web allows anyone to say anything, anywhere (I’m borrowing from Dean Allemang and Jim Hendler here). So, while RDF data about my publications may not be available on my own site yet, you can find pieces of the jigsaw at data.semanticweb.org, and if all conferences and journals published their proceedings/tables of contents in RDF, then my job would simply be to join the pieces together, and I wouldn’t be faced with manually updating my list of publications. OK, so we’re not there quite yet. Yes, there’s work to be done, but we’re trying.

Tom.
——–