Life, Universe and Computer Science: semantic web

Showing posts with label semantic web. Show all posts

Saturday, January 26, 2013

Making the Semantic Web work

In the last post, we've seen what is the Semantic Web and its objectives. But we used abstract concepts like "meaning", "knowledge", and hard-to-imagine scenarios like the ability to mine the Web and extract meaningful information from it without tweaking applications to parse specific HTML hierarchies of Web sites. Now, we're going to see how the Semantic Web vision is being implemented, and hopefully get a better grasp of what it will accomplish. In the end, you'll remark a strong similarity between the technologies and concepts presented here and two very recent advances in widely used search engines: Google's Knowledge Graph and Facebook's Graph Search, announced but not yet released. Although we can't tell if they are indeed using RDF and Ontologies - two concepts we'll see in a moment - internally, they are surely using, at the very least, something very similar to it. Without further ado, let's get it on.

Triples

The most primitive piece of information in the Semantic Web is a triple, with the following form:

(subject, predicate, object)

Informal examples of triples would be (Ben, has spouse, Anne) and (Ben, has name, "David Benjamin"). Each triple adds some information to the thing that we call "Ben". Predicates are relationships between a thing and another thing or some primitive value. In the examples, the relation, or predicate, "has spouse" links Ben to Anne, and "has name" links Ben to his full name. We can think of a collection of many triples as a directed graph, where subjects and objects are vertices and triples are edges from the subject to the object having the predicate as a label.

In this example, "Ben" is the internal name we gave to an entity. Obviously, that's not practical, for a reason we

criticized in the last post: it's not canonical. If we want to share our triples or use triples coming from outside, how will others know our Ben is the same as their Ben? Conversely how will we mine information about our Ben on the Web and not confuse it with data related to other Bens? The solution proposed by the Semantic Web is that of having each node of that graph identified by an URI, and preferably an URI that is also an URL. So, instead of "Ben", we could have "http://www.example.com/ben_kennedy", or an URL pointing to his Facebook/Google+ profile. We can use any URI that's certainly not being used to identify others. The recommendation goes even further: if it's an URL, then it'd better point to a file that describes the entity identified by that URL (in RDF, a format we'll see later in this post). Generally, either this URI will be provided by John himself or it will be extracted from an outer source. That's what asserts it's in fact a canonical identifier for him.

So, your URI is your signature in the Semantic Web. In the future, it's possible that Web sites will ask for your URI when you create an account on them, so they can recognize you in semantic terms. One issue that may arise here is: what if two systems use different URIs to identify the same thing in the real world? Well, then, there are two cases: when you know the two URIs mean the same thing, and when you don't. In the first situation, there is an easy fix: in an ontology (we'll get there in a moment), you can say explicitly state two things are synonyms, and then everything works. On the other hand, when you don't know that two URIs are in fact synonyms, there isn't much to do. But it is possible that someone out there knows the relation between the URIs and explicitly tell they are synonyms in their public data, and you can use that information in a seamless manner.

Resource Description Framework: A standard for triples

Obviously, if you want to exchange triples with Web systems you don't know in advance, a default communication language is needed. That's exactly what RDF is: a specification for representing triples. A RDF file is, in the last analysis, a list of triples, which makes it represent a semantic graph (or a part of one). There are several file formats for serializing RDF: N-Triples, N3, Turtle, RDF/XML, and some others. What happens more generally is that systems store triples in some implementation-specific form and export them to the outer world as RDF files. The storage system for the triples may even be (and usually is) a traditional SQL database.

To find information in a RDF graph, one generally uses SPARQL, a standardized query language for RDF. SPARQL queries present a spectacular level of freedom and flexibility, unlike SQL. For example, suppose you need to see a doctor, but would like a recommendation. One way to go is query your favorite social network for friends of yours that have doctors as friends, and these doctors should live in your home town. This translates easily to a SPARQL query:

SELECT ?friend, ?doctor, ?doctorPhoneNumber

WHERE {

me isFriendOf ?friend .

?friend isFriendOf ?doctor .

?doctor a Doctor .

OPTIONAL { ?doctor hasPhoneNumber ?doctorPhoneNumber . }

}

This would show all friends of yours who know doctors, and the doctors' phone numbers for those who have one available in the database. Of course, "me" should be your URI, and the predicates isFriendOf and hasPhoneNumber and the class Doctor are not standard, but just a simple example. With some clever natural language processing, one could think of translating a user query to a SPARQL query to achieve what Facebook's Graph Search supposedly will do. Take a look at their examples and see how SPARQL is way more natural than SQL to find information. The example query above could easily come from a user query "doctors that are friends of my friends".

Ontologies: triples specified

The presented SPARQL query for finding doctors your friends know has a simple yet critical problem: it has the weak assumption that people in the graph are related using "isFriendOf" predicates, that there is a class of people called "Doctor" and that people are connected to their phone numbers using the predicate "hasPhoneNumber". While those names are reasonable for people, they don't tell machines much. Besides that, if you want to share information and use data from the Web without modifying your code for each data source, you'll probably suffer from the same standardization problem that would prevent you from knowing people on the Web are the same people in your database. What would prevent other sites from representing friendships with a predicate called "hasFriend"?

One way to solve such problem would be to have a public agreement that everyone representing people would use some standard. That's basically what ontologies are: specifications of how to represent and relate things in a RDF graph. Nowadays, the most widely used ontology is FOAF (Friend of a Friend), which is an ontology to describe people and friendships. The idea is that of having a RDF file describing you and your friends, relating you to things you do, to your interests, etc., independently of a social network. Surely, users are not supposed to write a RDF file manually, as there should be user-friendly ways of getting one (for examples, some social networks support exporting public user data as FOAF files).

So, whenever you are about to represent people, in order to use and share information on the Web, you should use FOAF. What if you want to represent something about people that is outside FOAF's scope? For example, you could be developing a professional network Web site, like LinkedIn, and want to represent corporations and employment relations. Well, nothing hinders the simultaneous use of more than one ontology. Thus, for professional relations and companies, you can create your own ontology (or use an existing one) and use FOAF for people. Because ontologies use URIs are identifiers, you'll know that the same "thing" (in this case, person) is in two domains, but still is the same entity in the real world.

Now, if a team is developing a Semantic Web application, why would it refuse to conform to an ontology? Well, it may be the case the ontology is fairly insufficient to meet the developers' specific needs. In this situation, the team will create its own ontology. Why not publish it, together with some data, so that others can benefit from it and also help you giving publishing their data in a way you can understand?

Ontologies are generally represented in the Web Ontology Language (OWL). An OWL file is, in fact, also a RDF file. The OWL specification has enough vocabulary to model ontologies regardless of the domain they should attend.

The Giant Global Graph

Today, the Internet contains a World Wide Web of documents that link to each other. In the vision of the Semantic Web, we will have a Giant Global Graph instead. This graph would have semantic information about everything that is now only written for humans in natural language. Can you imagine what we could do with that?

In a glance

The Semantic Web is not just an abstract vision of the future: the tools needed for it to become a reality are all available and being used. It depends a lot on data sharing and publishing, which is something that most companies of today's world aren't willing to do. However, that doesn't ruin it all, as there's already a significant amount of semantic data available on the Web (Freebase is an example). Also, Google's Knowledge Graph and Facebook's Graph Search are clearly benefiting from the Semantic Web concepts and ideas (and, who knows, standards and technologies, as well), which shows the Semantic Web is, although still a dream, by no means dead.

Friday, January 11, 2013

Oh, the Semantic Web... wait, what?

In the last decade, a new term popped in the Web, apparently related to the future of the Internet: the Semantic Web. However, it's still difficult to grasp this concept without going deeper than we generally have the time to go. That opened up space for many myths and misconceptions regarding what is exactly the Semantic Web, if it is indeed a good idea, if it's feasible, if one should really care about it.

As I already cited here in the blog, Nepomuk aims to offer the user a Semantic Desktop, and is closely related to the Semantic Web. Understanding the Semantic Web turned out to be of paramount importance for me to understand Nepomuk better, thus being more useful to the project. So, I resorted to the literature on the topic, and here I'll tell you what I found.

First of all, let's review and demystify some phrases that you very likely heard before about the Semantic Web:

1- "The Semantic Web will more than certainly be the standard within X years!"

Many say the Semantic Web and the technologies related to it will reach an idyllic moment of great popularity someday. That's possible, as the solutions the Semantic Web tries to offer are widely applicable. But nowadays the fraction of the Web which is semantically enriched is minimal. The Semantic Web is still a vision, distant from being in fact a reality.

2- "The Semantic Web solves all the problems of the human race! You should start RDF and OWL in all your projects now!"

The applicability of the Semantic Web concepts is not rarely exaggerated, especially when the propaganda is designed towards executives and managers. The Semantic Web has interesting solutions to many problems, but not all, and also has its own drawbacks.

3- "Semantic Web works like magic: just use it and your application will become intelligent in ways you could never imagine!"

Like everything in Computer Science, modelling a good solution to a problem is always a lot more important than using good tools. The best tool is useless if used in the wrong way. The Semantic Web is a great tool to do a lot of things, but it can't solve a problem if it plays the wrong role in the solution plan.

Putting all this aside, let's point out some issues we face with the current Web architecture.

Data sharing

In general, Web sites are means of dealing with data. And, obviously, many Web sites deal with very similar types of data. For example, social networks all deal with people, who have names, email addresses, pictures, and so on. But that is the user's perspective. In the machines' point of view, a person to one social network is nothing like a person to another one. Those applications (like Social Plus) that try to integrate different social networks have to treat each one separately and specially, as there is no canonical definition of what a "Person" is, and how can "Person" entities relate to each other.

Fine, that sounds like a good thing when you don't want to share the data you have. Surely, social networks want users to use them as exclusively as possible, and also to possess the data left in it (friendships, for example). But even in the sphere of social networks, it isn't necessarily good to keep all your data private. While that means others are not going to access information you own, it also means that you can't make much use of the data available outside. For example, it's not possible for Facebook to infer that one of its users is also the author of a blog or book, or the contributor of an open source project, because "Person" means different things for Facebook, your blog, Github, and every other Web site. That is, Facebook can download this page and read that "Gabriel" wrote this post. But how will it surely tell that this is also one of their users? In the Semantic Web, entities have strong identifiers, that allows you to recognize entities in any context. Facebook would surely benefit from knowing where else do its users go. If it knows you are a Github user, for example, it would be able to show ads or suggest pages and groups related to software development even if you didn't mention anything related to it.

Showing yourself to the world

In other situations, you want to make your data as accessible as possible. A simple example would be online stores. If you are the owner of a Web site of such category, you want everyone out there to know you are selling your products, the prices you offer, the locations you ship to. But you can't do much more than offer a graphical interface for the users to browse and buy your products and hope search engines index your site well. The ideal situation would be if you could yell in an universal language (that machines, not only humans, can understand): "Hey! We have the latest and coolest tablets in stock! And we have a physical store in your city, 5 minutes from your home!".

Interoperability in an ecosystem of applications

Of course, sometimes you really don't want the data you have to leave your playground. But even in those cases, you don't want information confined in one particular application if other software you control could benefit from it. Let's take Google for example: they run a social network and an email service, and the two systems deal with people. Wouldn't they want that both Gmail and Google+ communicate with each other, exchanging information about people in a standardized way that both (and possibly other existing and future Google services) can understand? That's also Nepomuk's case: you don't KAddressBook to publish your contacts list on the Web, but it would be awesome if other applications, like KMail, could know what a Contact is and use the knowledge that you know some people that have email addresses.

Semantic Data

The problem is already clear enough. The Semantic Web approaches it by introducing means of serializing data in uniform ways, sharing vocabularies and meanings and representing knowledge, not only raw data.

Natural language is awesome for humans, but terrible for computers. We can read and understand (almost) anything other people write. But the same is not true for machines. Surely, search engines process and do a lot with pure natural language documents. You can answer many question by doing a simple Google search and looking at the first two or three results. But it's not processing semantics: the search results are the product of lots of guesses based on keywords, popularity, users behavior, and so on, that turn out to work. But that task of answering questions by searching the Web would be much simpler and more effective in the Semantic Web. Semantic data is like natural language for computers: a program that can process semantic information can virtually use any source of semantic data without the need of being tweaked to that. Of course, it doesn't mean any program can make useful things with any data. The same happens with us: I can read and understand an article about Pygmy Marmosets, but that doesn't mean I can do much with it, or that it will interest me. On the other hand, a biologist could possibly see a research topic in that.

Nowadays, semantics are generally in applications, not in data. Relational databases store dummy data, applications (tailored to use that specific database schema) intelligently use that data and show results to the user (using people's language). If you give an application an arbitrary relational database it was not designed to use, it won't be able to do anything with the data. Of course, a human reading the table and field names would certainly tell what the database is about and know how to reason about it. But machines can't capture that meaning on their own. Using Semantic Web standards, data is made available together with a system of meanings and knowledge about that domain, which makes it easily readable by machines, regardless of details like table and field names, which are designed for humans, after all.

In a glance

The Semantic Web is about publishing and using data on the Web using standard forms that embed precise meaning definitions for machines. It makes it possible to have canonical identifiers for entities like people, books or products among the whole Web, creating an unambiguous vocabulary for machines to communicate to each other. It allows applications to use any data source without the need of having been customized to a specific one. Searching in the Semantic Web is definitely easier, as machines can access knowledge and meaning about content, besides natural language representations, and thus make inferences and deductions that can't be more than wild guesses if what you have is an algorithm processing natural language.

Stating it in that way, it seems the Semantic Web is a good idea. However, it will only be useful when a significant number of Web applications publish data in semantic forms. Before that, Tim Berners Lee's dream will remain a dream. There are some other difficulties as well, which I didn't discuss here.

In the next post, I'll talk about the main concepts and technologies behind the Semantic Web. That includes strong identifiers (URIs), graphs and triples (RDF) and Ontologies (OWL). I hope it will then become clear how this set of abstract concepts like meaning and knowledge can be interpreted by machines.

Monday, December 10, 2012

Developer account!

With Vishesh Handa's support, I got a KDE Developer Account! Using it, I pushed the commit that implements the refactoring described in the last post. "With great power comes great responsibility".

By the way, in January, when I'll have time, I plan to read some books on semantic technologies, like ontologies, RDF and OWL. Fortunately, many books on these subjects are available in my university's libraries. I'll go to the Faculty of Information Science library and choose two from:

1- OWL: representing information using the web ontology language
2- Programming the Semantic Web
3- Semantic Web Programming
4- Semantic Web: a guide to the future of XML, Web services and knowledge management

Who knows if knowledge won't be useful outside Nepomuk. Although Google Trends shows a steady decrease in the number of searches for "Semantic Web" and related terms, Oracle, IBM and even Google have done work and research on the subject, with some positive results. Also, Semantic Web technologies don't need to be applied to the Web - Nepomuk proves it.