Friday, January 11, 2013

Oh, the Semantic Web... wait, what?

In the last decade, a new term popped in the Web, apparently related to the future of the Internet: the Semantic Web. However, it's still difficult to grasp this concept without going deeper than we generally have the time to go. That opened up space for many myths and misconceptions regarding what is exactly the Semantic Web, if it is indeed a good idea, if it's feasible, if one should really care about it.

As I already cited here in the blog, Nepomuk aims to offer the user a Semantic Desktop, and is closely related to the Semantic Web. Understanding the Semantic Web turned out to be of paramount importance for me to understand Nepomuk better, thus being more useful to the project. So, I resorted to the literature on the topic, and here I'll tell you what I found.

First of all, let's review and demystify some phrases that you very likely heard before about the Semantic Web:

1- "The Semantic Web will more than certainly be the standard within X years!"

Many say the Semantic Web and the technologies related to it will reach an idyllic moment of great popularity someday. That's possible, as the solutions the Semantic Web tries to offer are widely applicable. But nowadays the fraction of the Web which is semantically enriched is minimal. The Semantic Web is still a vision, distant from being in fact a reality.

2- "The Semantic Web solves all the problems of the human race! You should start RDF and OWL in all your projects now!"

The applicability of the Semantic Web concepts is not rarely exaggerated, especially when the propaganda is designed towards executives and managers. The Semantic Web has interesting solutions to many problems, but not all, and also has its own drawbacks.

3- "Semantic Web works like magic: just use it and your application will become intelligent in ways you could never imagine!"

Like everything in Computer Science, modelling a good solution to a problem is always a lot more important than using good tools. The best tool is useless if used in the wrong way. The Semantic Web is a great tool to do a lot of things, but it can't solve a problem if it plays the wrong role in the solution plan.

Putting all this aside, let's point out some issues we face with the current Web architecture.

Data sharing

In general, Web sites are means of dealing with data. And, obviously, many Web sites deal with very similar types of data. For example, social networks all deal with people, who have names, email addresses, pictures, and so on. But that is the user's perspective. In the machines' point of view, a person to one social network is nothing like a person to another one. Those applications (like Social Plus) that try to integrate different social networks have to treat each one separately and specially, as there is no canonical definition of what a "Person" is, and how can "Person" entities relate to each other.

Fine, that sounds like a good thing when you don't want to share the data you have. Surely, social networks want users to use them as exclusively as possible, and also to possess the data left in it (friendships, for example). But even in the sphere of social networks, it isn't necessarily good to keep all your data private. While that means others are not going to access information you own, it also means that you can't make much use of the data available outside. For example, it's not possible for Facebook to infer that one of its users is also the author of a blog or book, or the contributor of an open source project, because "Person" means different things for Facebook, your blog, Github, and every other Web site. That is, Facebook can download this page and read that "Gabriel" wrote this post. But how will it surely tell that this is also one of their users? In the Semantic Web, entities have strong identifiers, that allows you to recognize entities in any context. Facebook would surely benefit from knowing where else do its users go. If it knows you are a Github user, for example, it would be able to show ads or suggest pages and groups related to software development even if you didn't mention anything related to it.

Showing yourself to the world

In other situations, you want to make your data as accessible as possible. A simple example would be online stores.  If you are the owner of a Web site of such category, you want everyone out there to know you are selling your products, the prices you offer, the locations you ship to. But you can't do much more than offer a graphical interface for the users to browse and buy your products and hope search engines index your site well. The ideal situation would be if you could yell in an universal language (that machines, not only humans, can understand): "Hey! We have the latest and coolest tablets in stock! And we have a physical store in your city, 5 minutes from your home!".

Interoperability in an ecosystem of applications

Of course, sometimes you really don't want the data you have to leave your playground. But even in those cases, you don't want information confined in one particular application if other software you control could benefit from it. Let's take Google for example: they run a social network and an email service, and the two systems deal with people. Wouldn't they want that both Gmail and Google+ communicate with each other, exchanging information about people in a standardized way that both (and possibly other existing and future Google services) can understand? That's also Nepomuk's case: you don't KAddressBook to publish your contacts list on the Web, but it would be awesome if other applications, like KMail, could know what a Contact is and use the knowledge that you know some people that have email addresses.

Semantic Data

The problem is already clear enough. The Semantic Web approaches it by introducing means of serializing data in uniform ways, sharing vocabularies and meanings and representing knowledge, not only raw data.

Natural language is awesome for humans, but terrible for computers. We can read and understand (almost) anything other people write. But the same is not true for machines. Surely, search engines process and do a lot with pure natural language documents. You can answer many question by doing a simple Google search and looking at the first two or three results. But it's not processing semantics: the search results are the product of lots of guesses based on keywords, popularity, users behavior, and so on, that turn out to work. But that task of answering questions by searching the Web would be much simpler and more effective in the Semantic Web. Semantic data is like natural language for computers: a program that can process semantic information can virtually use any source of semantic data without the need of being tweaked to that. Of course, it doesn't mean any program can make useful things with any data. The same happens with us: I can read and understand an article about Pygmy Marmosets, but that doesn't mean I can do much with it, or that it will interest me. On the other hand, a biologist could possibly see a research topic in that.

Nowadays, semantics are generally in applications, not in data. Relational databases store dummy data, applications (tailored to use that specific database schema) intelligently use that data and show results to the user (using people's language). If you give an application an arbitrary relational database it was not designed to use, it won't be able to do anything with the data. Of course, a human reading the table and field names would certainly tell what the database is about and know how to reason about it. But machines can't capture that meaning on their own. Using Semantic Web standards, data is made available together with a system of meanings and knowledge about that domain, which makes it easily readable by machines, regardless of details like table and field names, which are designed for humans, after all.

In a glance

The Semantic Web is about publishing and using data on the Web using standard forms that embed precise meaning definitions for machines. It makes it possible to have canonical identifiers for entities like people, books or products among the whole Web, creating an unambiguous vocabulary for machines to communicate to each other. It allows applications to use any data source without the need of having been customized to a specific one. Searching in the Semantic Web is definitely easier, as machines can access knowledge and meaning about content, besides natural language representations, and thus make inferences and deductions that can't be more than wild guesses if what you have is an algorithm processing natural language.

Stating it in that way, it seems the Semantic Web is a good idea. However, it will only be useful when a significant number of Web applications publish data in semantic forms. Before that, Tim Berners Lee's dream will remain a dream. There are some other difficulties as well, which I didn't discuss here.

In the next post, I'll talk about the main concepts and technologies behind the Semantic Web. That includes strong identifiers (URIs), graphs and triples (RDF) and Ontologies (OWL). I hope it will then become clear how this set of abstract concepts like meaning and knowledge can be interpreted by machines.

No comments:

Post a Comment