Saturday, January 26, 2013

Making the Semantic Web work


In the last post, we've seen what is the Semantic Web and its objectives. But we used abstract concepts like "meaning", "knowledge", and hard-to-imagine scenarios like the ability to mine the Web and extract meaningful information from it without tweaking applications to parse specific HTML hierarchies of Web sites. Now, we're going to see how the Semantic Web vision is being implemented, and hopefully get a better grasp of what it will accomplish. In the end, you'll remark a strong similarity between the technologies and concepts presented here and two very recent advances in widely used search engines: Google's Knowledge Graph and Facebook's Graph Search, announced but not yet released. Although we can't tell if they are indeed using RDF and Ontologies - two concepts we'll see in a moment - internally, they are surely using, at the very least, something very similar to it. Without further ado, let's get it on.

Triples


The most primitive piece of information in the Semantic Web is a triple, with the following form:

(subject, predicate, object)

Informal examples of triples would be (Ben, has spouse, Anne) and (Ben, has name, "David Benjamin"). Each triple adds some information to the thing that we call "Ben". Predicates are relationships between a thing and another thing or some primitive value. In the examples, the relation, or predicate, "has spouse" links Ben to Anne, and "has name" links Ben to his full name. We can think of a collection of many triples as a directed graph, where subjects and objects are vertices and triples are edges from the subject to the object having the predicate as a label.



In this example, "Ben" is the internal name we gave to an entity. Obviously, that's not practical, for a reason we
criticized in the last post: it's not canonical. If we want to share our triples or use triples coming from outside, how will others know our Ben is the same as their Ben? Conversely how will we mine information about our Ben on the Web and not confuse it with data related to other Bens? The solution proposed by the Semantic Web is that of having each node of that graph identified by an URI, and preferably an URI that is also an URL. So, instead of "Ben", we could have "http://www.example.com/ben_kennedy", or an URL pointing to his Facebook/Google+ profile. We can use any URI that's certainly not being used to identify others. The recommendation goes even further: if it's an URL, then it'd better point to a file that describes the entity identified by that URL (in RDF, a format we'll see later in this post). Generally, either this URI will be provided by John himself or it will be extracted from an outer source. That's what asserts it's in fact a canonical identifier for him.

So, your URI is your signature in the Semantic Web. In the future, it's possible that Web sites will ask for your URI when you create an account on them, so they can recognize you in semantic terms. One issue that may arise here is: what if two systems use different URIs to identify the same thing in the real world? Well, then, there are two cases: when you know the two URIs mean the same thing, and when you don't. In the first situation, there is an easy fix: in an ontology (we'll get there in a moment), you can say explicitly state two things are synonyms, and then everything works. On the other hand, when you don't know that two URIs are in fact synonyms, there isn't much to do. But it is possible that someone out there knows the relation between the URIs and explicitly tell they are synonyms in their public data, and you can use that information in a seamless manner.

Resource Description Framework: A standard for triples


Obviously, if you want to exchange triples with Web systems you don't know in advance, a default communication language is needed. That's exactly what RDF is: a specification for representing triples. A RDF file is, in the last analysis, a list of triples, which makes it represent a semantic graph (or a part of one). There are several file formats for serializing RDF: N-TriplesN3TurtleRDF/XML, and some others. What happens more generally is that systems store triples in some implementation-specific form and export them to the outer world as RDF files. The storage system for the triples may even be (and usually is) a traditional SQL database.

To find information in a RDF graph, one generally uses SPARQL, a standardized query language for RDF. SPARQL queries present a spectacular level of freedom and flexibility, unlike SQL. For example, suppose you need to see a doctor, but would like a recommendation. One way to go is query your favorite social network for friends of yours that have doctors as friends, and these doctors should live in your home town. This translates easily to a SPARQL query:

SELECT ?friend, ?doctor, ?doctorPhoneNumber
WHERE {
    me isFriendOf ?friend .
    ?friend isFriendOf ?doctor .
    ?doctor a Doctor .
    OPTIONAL { ?doctor hasPhoneNumber ?doctorPhoneNumber . }
}

This would show all friends of yours who know doctors, and the doctors' phone numbers for those who have one available in the database. Of course, "me" should be your URI, and the predicates isFriendOf and hasPhoneNumber and the class Doctor are not standard, but just a simple example. With some clever natural language processing, one could think of translating a user query to a SPARQL query to achieve what Facebook's Graph Search supposedly will do. Take a look at their examples and see how SPARQL is way more natural than SQL to find information. The example query above could easily come from a user query "doctors that are friends of my friends".

Ontologies: triples specified


The presented SPARQL query for finding doctors your friends know has a simple yet critical problem: it has the weak assumption that people in the graph are related using "isFriendOf" predicates, that there is a class of people called "Doctor" and that people are connected to their phone numbers using the predicate "hasPhoneNumber". While those names are reasonable for people, they don't tell machines much. Besides that, if you want to share information and use data from the Web without modifying your code for each data source, you'll probably suffer from the same standardization problem that would prevent you from knowing people on the Web are the same people in your database. What would prevent other sites from representing friendships with a predicate called "hasFriend"?

One way to solve such problem would be to have a public agreement that everyone representing people would use some standard. That's basically what ontologies are: specifications of how to represent and relate things in a RDF graph. Nowadays, the most widely used ontology is FOAF (Friend of a Friend), which is an ontology to describe people and friendships. The idea is that of having a RDF file describing you and your friends, relating you to things you do, to your interests, etc., independently of a social network. Surely, users are not supposed to write a RDF file manually, as there should be user-friendly ways of getting one (for examples, some social networks support exporting public user data as FOAF files).

So, whenever you are about to represent people, in order to use and share information on the Web, you should use FOAF. What if you want to represent something about people that is outside FOAF's scope? For example, you could be developing a professional network Web site, like LinkedIn, and want to represent corporations and employment relations. Well, nothing hinders the simultaneous use of more than one ontology. Thus, for professional relations and companies, you can create your own ontology (or use an existing one) and use FOAF for people. Because ontologies use URIs are identifiers, you'll know that the same "thing" (in this case, person) is in two domains, but still is the same entity in the real world.

Now, if a team is developing a Semantic Web application, why would it refuse to conform to an ontology? Well, it may be the case the ontology is fairly insufficient to meet the developers' specific needs. In this situation, the team will create its own ontology. Why not publish it, together with some data, so that others can benefit from it and also help you giving publishing their data in a way you can understand?

Ontologies are generally represented in the Web Ontology Language (OWL). An OWL file is, in fact, also a RDF file. The OWL specification has enough vocabulary to model ontologies regardless of the domain they should attend.

The Giant Global Graph


Today, the Internet contains a World Wide Web of documents that link to each other. In the vision of the Semantic Web, we will have a Giant Global Graph instead. This graph would have semantic information about everything that is now only written for humans in natural language. Can you imagine what we could do with that?

In a glance


The Semantic Web is not just an abstract vision of the future: the tools needed for it to become a reality are all available and being used. It depends a lot on data sharing and publishing, which is something that most companies of today's world aren't willing to do. However, that doesn't ruin it all, as there's already a significant amount of semantic data available on the Web (Freebase is an example). Also, Google's Knowledge Graph and Facebook's Graph Search are clearly benefiting from the Semantic Web concepts and ideas (and, who knows, standards and technologies, as well), which shows the Semantic Web is, although still a dream, by no means dead.

No comments:

Post a Comment