Life, Universe and Computer Science

Sunday, April 14, 2013

Google Summer of Code

Hey, there!

It has been quite some time since my last post here. I'm having very little "free" time as my CS course is tasking me a lot this semester. Two cool programming assignments I'm currently working on are:

In the Operating Systems course, we have to do some non-trivial modifications in the Linux Kernel source of our choice. I'm planning to do something related to file system monitoring, as this knowledge can be useful for the Nepomuk File Watch service. One specific thing I'll maybe do is implement recursive watches in inotify. Of course, it will probably not be production-ready after that; the most important contribution to Nepomuk will be a better understanding of the inner workings of inotify (and maybe the reason why it doesn't support recursive watches will be clear).

In the Code Analysis and Optimization class, our first two programming assignments involve writing LLVM Passes. This has nothing to do with Nepomuk, but anyway it's still very cool to hack a real-world compiler.

Besides doing lots of programming assignments and training for programming competitions, I'm also writing my proposal for Google Summer of Code! I plan to take on the only idea related to Nepomuk in KDE's ideas page, which is to rewrite the Nepomuk Query Parser, basing the syntax for queries on a formal grammar, and add support to brand-new features. If everything goes well, in the end, you'll be able to write some very nice queries in KRunner, like:

"music yesterday" would return the musics you heard the day before

"(music or document) tagged classes" would search for files with the tag "Classes"

and others I still didn't think of.

Moreover, queries can also be done internally by other applications. That means the user will possibly see improvements even when not directly doing searches using KRunner or Dolphin (although I don't know a specific use case in which an application makes a query directly).

Also, the project includes adding auto completion to queries, using both Nepomuk keywords and previous queries. That means if you always search for that loved song or work on some documents in a daily basis, Nepomuk will remember you want them quickly :)

That's it for now. I'll probably finish the first version of the proposal in the middle of this week and then get it reviewed. Hope I'll have some news soon!

Wednesday, February 27, 2013

Hello, Planet!

Hello, PlanetKDE!

I'm Gabriel Poesia, a Brazilian Computer Science undergrad student at UFMG who is passionate about algorithms (I compete in programming contests, like ICPC) and free software. At the end of the last year, I started contributing to Nepomuk-KDE. I created this blog primarily to describe my first steps in the world of free software and talk about my work inside Nepomuk. Both writing here and contributing to KDE have been very interesting experiences for me, and I hope this blog can be useful and motivating to others, too. KDE is great, but still better than it is its community. There are very talented people working on it, and getting into it is a valuable opportunity to learn a lot about software development in general in real-world scenarios. The community is open and welcoming. That's easy to see while hanging on IRC!

In Nepomuk, until last month, I was mainly tackling minor issues, while understanding Nepomuk better. Today, I pushed NepomukCtl, a tool for controlling Nepomuk services, similar to akonadictl. Besides that, I'm working on making the FileWatcher service support different, independent back-ends. I'll probably participate in this year's Google Summer of Code, too! All this, of course, is not lone work. Help from fellow developers is always involved when working on KDE.

Nepomuk borrows many ideas and technologies from the Semantic Web (the name "Semantic Desktop" has a reason). Understanding the Semantic Web is very helpful when trying to grasp Nepomuk. So, I wrote a series of two posts about the Semantic Web. The first post (Oh, the Semantic Web... wait, what?) talks about its fundamental ideas, concepts and objectives. The second one (Making the Semantic Web work) goes a little deeper and shows how one would really implement those apparently abstract ideas. Many technologies cited there are indeed used in Nepomuk.

Well, that's it! I hope you enjoy my blog. You can find me on IRC, mainly in #kde-devel and #nepomuk-kde (my nickname is gpoesia there).

Wednesday, February 20, 2013

More Nepomuk File Watcher backends

After being in an ICPC Training Camp here in Brazil, I'm back to work on Nepomuk. I'll now work on implementing the support for more back-end options in the Nepomuk File Watcher service.

Nepomuk has a service called File Watcher, that monitors the file system, waiting for changes in files (content changed, file deleted, file moved, renamed, etc). When that happens, the changed file has to be reindexed, so the search will use its up-to-date contents.

The Linux kernel has a subsystem called Inotify that allows you to do that efficiently. You tell inotify what folders you want to monitor, it calls you when it spots an event of your interest. Currently, Nepomuk uses inotify on Linux to watch for changes. But it has it's limitations. For example, the number of watches you can create in a default installation is small, which may be a problem. Fortunately, there are some alternatives. KDE itself has a mechanism for doing that (KDirWatch), and Linux has the more recent fanotify. Each one has its advantages and disadvantages. What I'll be doing is making the File Watcher support these additional 2 back-ends, and use any subset of the three simultaneously (which will be independently enabled or disabled by the user). With a lot of help from Vishesh, of course.

That's it. Time to learn and code!

Saturday, January 26, 2013

Making the Semantic Web work

In the last post, we've seen what is the Semantic Web and its objectives. But we used abstract concepts like "meaning", "knowledge", and hard-to-imagine scenarios like the ability to mine the Web and extract meaningful information from it without tweaking applications to parse specific HTML hierarchies of Web sites. Now, we're going to see how the Semantic Web vision is being implemented, and hopefully get a better grasp of what it will accomplish. In the end, you'll remark a strong similarity between the technologies and concepts presented here and two very recent advances in widely used search engines: Google's Knowledge Graph and Facebook's Graph Search, announced but not yet released. Although we can't tell if they are indeed using RDF and Ontologies - two concepts we'll see in a moment - internally, they are surely using, at the very least, something very similar to it. Without further ado, let's get it on.

Triples

The most primitive piece of information in the Semantic Web is a triple, with the following form:

(subject, predicate, object)

Informal examples of triples would be (Ben, has spouse, Anne) and (Ben, has name, "David Benjamin"). Each triple adds some information to the thing that we call "Ben". Predicates are relationships between a thing and another thing or some primitive value. In the examples, the relation, or predicate, "has spouse" links Ben to Anne, and "has name" links Ben to his full name. We can think of a collection of many triples as a directed graph, where subjects and objects are vertices and triples are edges from the subject to the object having the predicate as a label.

In this example, "Ben" is the internal name we gave to an entity. Obviously, that's not practical, for a reason we

criticized in the last post: it's not canonical. If we want to share our triples or use triples coming from outside, how will others know our Ben is the same as their Ben? Conversely how will we mine information about our Ben on the Web and not confuse it with data related to other Bens? The solution proposed by the Semantic Web is that of having each node of that graph identified by an URI, and preferably an URI that is also an URL. So, instead of "Ben", we could have "http://www.example.com/ben_kennedy", or an URL pointing to his Facebook/Google+ profile. We can use any URI that's certainly not being used to identify others. The recommendation goes even further: if it's an URL, then it'd better point to a file that describes the entity identified by that URL (in RDF, a format we'll see later in this post). Generally, either this URI will be provided by John himself or it will be extracted from an outer source. That's what asserts it's in fact a canonical identifier for him.

So, your URI is your signature in the Semantic Web. In the future, it's possible that Web sites will ask for your URI when you create an account on them, so they can recognize you in semantic terms. One issue that may arise here is: what if two systems use different URIs to identify the same thing in the real world? Well, then, there are two cases: when you know the two URIs mean the same thing, and when you don't. In the first situation, there is an easy fix: in an ontology (we'll get there in a moment), you can say explicitly state two things are synonyms, and then everything works. On the other hand, when you don't know that two URIs are in fact synonyms, there isn't much to do. But it is possible that someone out there knows the relation between the URIs and explicitly tell they are synonyms in their public data, and you can use that information in a seamless manner.

Resource Description Framework: A standard for triples

Obviously, if you want to exchange triples with Web systems you don't know in advance, a default communication language is needed. That's exactly what RDF is: a specification for representing triples. A RDF file is, in the last analysis, a list of triples, which makes it represent a semantic graph (or a part of one). There are several file formats for serializing RDF: N-Triples, N3, Turtle, RDF/XML, and some others. What happens more generally is that systems store triples in some implementation-specific form and export them to the outer world as RDF files. The storage system for the triples may even be (and usually is) a traditional SQL database.

To find information in a RDF graph, one generally uses SPARQL, a standardized query language for RDF. SPARQL queries present a spectacular level of freedom and flexibility, unlike SQL. For example, suppose you need to see a doctor, but would like a recommendation. One way to go is query your favorite social network for friends of yours that have doctors as friends, and these doctors should live in your home town. This translates easily to a SPARQL query:

SELECT ?friend, ?doctor, ?doctorPhoneNumber

WHERE {

me isFriendOf ?friend .

?friend isFriendOf ?doctor .

?doctor a Doctor .

OPTIONAL { ?doctor hasPhoneNumber ?doctorPhoneNumber . }

}

This would show all friends of yours who know doctors, and the doctors' phone numbers for those who have one available in the database. Of course, "me" should be your URI, and the predicates isFriendOf and hasPhoneNumber and the class Doctor are not standard, but just a simple example. With some clever natural language processing, one could think of translating a user query to a SPARQL query to achieve what Facebook's Graph Search supposedly will do. Take a look at their examples and see how SPARQL is way more natural than SQL to find information. The example query above could easily come from a user query "doctors that are friends of my friends".

Ontologies: triples specified

The presented SPARQL query for finding doctors your friends know has a simple yet critical problem: it has the weak assumption that people in the graph are related using "isFriendOf" predicates, that there is a class of people called "Doctor" and that people are connected to their phone numbers using the predicate "hasPhoneNumber". While those names are reasonable for people, they don't tell machines much. Besides that, if you want to share information and use data from the Web without modifying your code for each data source, you'll probably suffer from the same standardization problem that would prevent you from knowing people on the Web are the same people in your database. What would prevent other sites from representing friendships with a predicate called "hasFriend"?

One way to solve such problem would be to have a public agreement that everyone representing people would use some standard. That's basically what ontologies are: specifications of how to represent and relate things in a RDF graph. Nowadays, the most widely used ontology is FOAF (Friend of a Friend), which is an ontology to describe people and friendships. The idea is that of having a RDF file describing you and your friends, relating you to things you do, to your interests, etc., independently of a social network. Surely, users are not supposed to write a RDF file manually, as there should be user-friendly ways of getting one (for examples, some social networks support exporting public user data as FOAF files).

So, whenever you are about to represent people, in order to use and share information on the Web, you should use FOAF. What if you want to represent something about people that is outside FOAF's scope? For example, you could be developing a professional network Web site, like LinkedIn, and want to represent corporations and employment relations. Well, nothing hinders the simultaneous use of more than one ontology. Thus, for professional relations and companies, you can create your own ontology (or use an existing one) and use FOAF for people. Because ontologies use URIs are identifiers, you'll know that the same "thing" (in this case, person) is in two domains, but still is the same entity in the real world.

Now, if a team is developing a Semantic Web application, why would it refuse to conform to an ontology? Well, it may be the case the ontology is fairly insufficient to meet the developers' specific needs. In this situation, the team will create its own ontology. Why not publish it, together with some data, so that others can benefit from it and also help you giving publishing their data in a way you can understand?

Ontologies are generally represented in the Web Ontology Language (OWL). An OWL file is, in fact, also a RDF file. The OWL specification has enough vocabulary to model ontologies regardless of the domain they should attend.

The Giant Global Graph

Today, the Internet contains a World Wide Web of documents that link to each other. In the vision of the Semantic Web, we will have a Giant Global Graph instead. This graph would have semantic information about everything that is now only written for humans in natural language. Can you imagine what we could do with that?

In a glance

The Semantic Web is not just an abstract vision of the future: the tools needed for it to become a reality are all available and being used. It depends a lot on data sharing and publishing, which is something that most companies of today's world aren't willing to do. However, that doesn't ruin it all, as there's already a significant amount of semantic data available on the Web (Freebase is an example). Also, Google's Knowledge Graph and Facebook's Graph Search are clearly benefiting from the Semantic Web concepts and ideas (and, who knows, standards and technologies, as well), which shows the Semantic Web is, although still a dream, by no means dead.

Friday, January 11, 2013

Oh, the Semantic Web... wait, what?

In the last decade, a new term popped in the Web, apparently related to the future of the Internet: the Semantic Web. However, it's still difficult to grasp this concept without going deeper than we generally have the time to go. That opened up space for many myths and misconceptions regarding what is exactly the Semantic Web, if it is indeed a good idea, if it's feasible, if one should really care about it.

As I already cited here in the blog, Nepomuk aims to offer the user a Semantic Desktop, and is closely related to the Semantic Web. Understanding the Semantic Web turned out to be of paramount importance for me to understand Nepomuk better, thus being more useful to the project. So, I resorted to the literature on the topic, and here I'll tell you what I found.

First of all, let's review and demystify some phrases that you very likely heard before about the Semantic Web:

1- "The Semantic Web will more than certainly be the standard within X years!"

Many say the Semantic Web and the technologies related to it will reach an idyllic moment of great popularity someday. That's possible, as the solutions the Semantic Web tries to offer are widely applicable. But nowadays the fraction of the Web which is semantically enriched is minimal. The Semantic Web is still a vision, distant from being in fact a reality.

2- "The Semantic Web solves all the problems of the human race! You should start RDF and OWL in all your projects now!"

The applicability of the Semantic Web concepts is not rarely exaggerated, especially when the propaganda is designed towards executives and managers. The Semantic Web has interesting solutions to many problems, but not all, and also has its own drawbacks.

3- "Semantic Web works like magic: just use it and your application will become intelligent in ways you could never imagine!"

Like everything in Computer Science, modelling a good solution to a problem is always a lot more important than using good tools. The best tool is useless if used in the wrong way. The Semantic Web is a great tool to do a lot of things, but it can't solve a problem if it plays the wrong role in the solution plan.

Putting all this aside, let's point out some issues we face with the current Web architecture.

Data sharing

In general, Web sites are means of dealing with data. And, obviously, many Web sites deal with very similar types of data. For example, social networks all deal with people, who have names, email addresses, pictures, and so on. But that is the user's perspective. In the machines' point of view, a person to one social network is nothing like a person to another one. Those applications (like Social Plus) that try to integrate different social networks have to treat each one separately and specially, as there is no canonical definition of what a "Person" is, and how can "Person" entities relate to each other.

Fine, that sounds like a good thing when you don't want to share the data you have. Surely, social networks want users to use them as exclusively as possible, and also to possess the data left in it (friendships, for example). But even in the sphere of social networks, it isn't necessarily good to keep all your data private. While that means others are not going to access information you own, it also means that you can't make much use of the data available outside. For example, it's not possible for Facebook to infer that one of its users is also the author of a blog or book, or the contributor of an open source project, because "Person" means different things for Facebook, your blog, Github, and every other Web site. That is, Facebook can download this page and read that "Gabriel" wrote this post. But how will it surely tell that this is also one of their users? In the Semantic Web, entities have strong identifiers, that allows you to recognize entities in any context. Facebook would surely benefit from knowing where else do its users go. If it knows you are a Github user, for example, it would be able to show ads or suggest pages and groups related to software development even if you didn't mention anything related to it.

Showing yourself to the world

In other situations, you want to make your data as accessible as possible. A simple example would be online stores. If you are the owner of a Web site of such category, you want everyone out there to know you are selling your products, the prices you offer, the locations you ship to. But you can't do much more than offer a graphical interface for the users to browse and buy your products and hope search engines index your site well. The ideal situation would be if you could yell in an universal language (that machines, not only humans, can understand): "Hey! We have the latest and coolest tablets in stock! And we have a physical store in your city, 5 minutes from your home!".

Interoperability in an ecosystem of applications

Of course, sometimes you really don't want the data you have to leave your playground. But even in those cases, you don't want information confined in one particular application if other software you control could benefit from it. Let's take Google for example: they run a social network and an email service, and the two systems deal with people. Wouldn't they want that both Gmail and Google+ communicate with each other, exchanging information about people in a standardized way that both (and possibly other existing and future Google services) can understand? That's also Nepomuk's case: you don't KAddressBook to publish your contacts list on the Web, but it would be awesome if other applications, like KMail, could know what a Contact is and use the knowledge that you know some people that have email addresses.

Semantic Data

The problem is already clear enough. The Semantic Web approaches it by introducing means of serializing data in uniform ways, sharing vocabularies and meanings and representing knowledge, not only raw data.

Natural language is awesome for humans, but terrible for computers. We can read and understand (almost) anything other people write. But the same is not true for machines. Surely, search engines process and do a lot with pure natural language documents. You can answer many question by doing a simple Google search and looking at the first two or three results. But it's not processing semantics: the search results are the product of lots of guesses based on keywords, popularity, users behavior, and so on, that turn out to work. But that task of answering questions by searching the Web would be much simpler and more effective in the Semantic Web. Semantic data is like natural language for computers: a program that can process semantic information can virtually use any source of semantic data without the need of being tweaked to that. Of course, it doesn't mean any program can make useful things with any data. The same happens with us: I can read and understand an article about Pygmy Marmosets, but that doesn't mean I can do much with it, or that it will interest me. On the other hand, a biologist could possibly see a research topic in that.

Nowadays, semantics are generally in applications, not in data. Relational databases store dummy data, applications (tailored to use that specific database schema) intelligently use that data and show results to the user (using people's language). If you give an application an arbitrary relational database it was not designed to use, it won't be able to do anything with the data. Of course, a human reading the table and field names would certainly tell what the database is about and know how to reason about it. But machines can't capture that meaning on their own. Using Semantic Web standards, data is made available together with a system of meanings and knowledge about that domain, which makes it easily readable by machines, regardless of details like table and field names, which are designed for humans, after all.

In a glance

The Semantic Web is about publishing and using data on the Web using standard forms that embed precise meaning definitions for machines. It makes it possible to have canonical identifiers for entities like people, books or products among the whole Web, creating an unambiguous vocabulary for machines to communicate to each other. It allows applications to use any data source without the need of having been customized to a specific one. Searching in the Semantic Web is definitely easier, as machines can access knowledge and meaning about content, besides natural language representations, and thus make inferences and deductions that can't be more than wild guesses if what you have is an algorithm processing natural language.

Stating it in that way, it seems the Semantic Web is a good idea. However, it will only be useful when a significant number of Web applications publish data in semantic forms. Before that, Tim Berners Lee's dream will remain a dream. There are some other difficulties as well, which I didn't discuss here.

In the next post, I'll talk about the main concepts and technologies behind the Semantic Web. That includes strong identifiers (URIs), graphs and triples (RDF) and Ontologies (OWL). I hope it will then become clear how this set of abstract concepts like meaning and knowledge can be interpreted by machines.

Wednesday, December 19, 2012

Qt 5 released!

The wait is over: Qt 5 is now officially the latest Qt version. And the guys at Qt Studios made a very "qt" video about it.

KDE is a giant ecosystem built on and heavily relying on Qt. Many Qt developers are KDE active contributors, and many KDE contributors came up to work with Qt for a living. For example, there's KDAB. And, of course, a new Qt release affects KDE a lot. KDE's major version is the same as Qt's. We're in KDE 4 because it uses Qt 4. KDE 5 (which will still take some time) will need a whole port to Qt 5.

This may cause some fear and uncertainty to the KDE users that were there in the early stages of KDE 4. KDE 3.5 was amazingly stable. When Qt 4 was released, the obvious path of porting KDE to Qt 4 was followed. However, considering KDE's size and the many incompatibilities between Qt 3 and 4, that was terribly hard, and KDE suffered for a long time with this. Many bugs originated at the port survived until KDE 4.4 or 4.5, which is not long ago. A lot of users migrated to other desktop environments in the occasion, keeping in mind a horrible image of KDE.

Nevertheless, everything promises a much less painful port this time. From the Qt 5 Docs:

"Qt 5 is highly compatible with Qt 4. It is possible for developers of Qt 4 applications to seamlessly move on to Qt 5 with their current functionality and gradually develop new things leveraging all the great items Qt 5 makes possible."

And that doesn't look like just propaganda. The list of source incompatible changes list seems reasonable, and KDE Frameworks 5.0 is being worked on. This is a time where unit tests are most helpful. Since this year, Nepomuk has its own tests framework. As Nepomuk is split into various services (File Watcher, File Indexer, Storage) that run in different processes, it's quite hard to test some changes in it. And what can be said about testing a whole port to Qt 5 without automated tests?

By the way, in Nepomuk, I've be working on some minor tasks in the last two weeks. Briefly, I investigated an optimization in RegExpCache, a class that mainly checks if a string matches against a list of regular expressions. It turns out Qt's QRegExp class is implemented as a Nondeterministic Finite Automaton, and so matching is not done in linear time. Thus, the straightforward trial of making a RegExp that is the union of many RegExps didn't work. Now, I'm trying to fix another bug, related to the File Watcher. This made me learn a lot, and probably another post will follow just about that one :)

Monday, December 10, 2012

Developer account!

With Vishesh Handa's support, I got a KDE Developer Account! Using it, I pushed the commit that implements the refactoring described in the last post. "With great power comes great responsibility".

By the way, in January, when I'll have time, I plan to read some books on semantic technologies, like ontologies, RDF and OWL. Fortunately, many books on these subjects are available in my university's libraries. I'll go to the Faculty of Information Science library and choose two from:

1- OWL: representing information using the web ontology language
2- Programming the Semantic Web
3- Semantic Web Programming
4- Semantic Web: a guide to the future of XML, Web services and knowledge management

Who knows if knowledge won't be useful outside Nepomuk. Although Google Trends shows a steady decrease in the number of searches for "Semantic Web" and related terms, Oracle, IBM and even Google have done work and research on the subject, with some positive results. Also, Semantic Web technologies don't need to be applied to the Web - Nepomuk proves it.

Saturday, December 8, 2012

Next step - refactoring

The next task I will work on will be a simple refactoring one, just like the first. I'll have to wait the end of this semester to start working harder on Nepomuk, but meanwhile I can still do something of some use. While the task itself isn't enough for much talk, the fact that it is a refactoring one is worth some attention.

Refactoring is not something an undergraduate student would care about. Why rewrite some functionality that is already working when you'll submit that programming assignment and never look at it again? This line of thought is perfectly applicable to projects of that nature - few lines of code, short life cycle, no perspective of growth.

However, when a code base starts to be measured in tens or hundreds of kilo lines of code (KLOCs), of which a good part was written years before, the situation changes drastically. Federico Quintero, known for co-founding the GNOME project, describes a state of badly written software that Refactoring practices try to thwart:

"When I was learning how to program, I noticed that I frequently hit the same problem over and over
again: I would often write a program that worked reasonably well, and even had a good structure, but
after some time of modifying it and enhancing it, I could no longer tweak it any further. Either its
complexity would overwhelm me, or it would be so tightly written that it allowed no room for
expansion, like a house where you cannot build up because it has a sloping roof, and you cannot build
to the sides because it has a wall all around it." [1]

This situation is perfectly conceivable. I have experienced it - like probably all programmers that started personal projects while learning to program. Continuous refactoring is a must to avoid that. I think this is even more true when the subject is a free software project. In an enterprise environment, people are paid to work, and it may simply not be an option to avoid touching that project with a messy code base and flawed architecture. In the free software world, on the other hand, people get in and out of projects everyday and as they wish - free software, free people. It's difficult for a newcomer to engage in a project where the simplest change can break everything, and parts that were meant to be simple are in fact hard to understand and tweak. And a free software project that fails to attract new blood to it is unlikely to prosper. KDE has a lot of initiatives that help to increase its Bus Factor (like the Techbase, Season of KDE, strong participation in Google Summer of Code, mentoring programs), and it seems to work well. At least for me.

That's it for now! While refactoring is surely good for the health of the project, it will also make me learn more about Nepomuk. The two patches I wrote were against the kde-runtime repository, where the configuration manager lies, and this one will be the first inside nepomuk-core. Time to work!

[1]: Software that has Quality Without a Name (which is part of a great book, Open Advice - What We Wish We Had Known When We Started, available for free)

Wednesday, December 5, 2012

Starting to contribute

Last week, I did my first effective contribution to a free software project (Nepomuk-KDE). Although the commit message doesn't tell it, the change was very simple; just a small refactoring. But before getting to that, I'd like to say a little about how KDE seems to welcome new developers.

First, there is the KDE Developers Guide, a concise e-book summarizing what you need to do to begin. I chose a cool project and tried to compile it, cloning Nepomuk's git repository and trying to build it as usual (cmake + make). After solving some dependencies (needed to install soprano's development packages), I managed to build it. But then I wondered if I would have to compile a whole KDE Environment to develop and test it, because of the nature of the project. The KDE Guide tells you to ask the developers directly in this situation, and so I did. I went to its IRC Channel and said I wanted to contribute. In a while, I got an answer from jEhrichs. And no, I didn't need to compile the entire KDE :)

After a couple of days, I talked to Vishesh Handa (vHanda), one of the main Nepomuk developers, and he pointed me to how I should build Nepomuk - inside a build environment, in which I would install it so it doesn't interfere with my system's installation. The process is thoroughly explained in the Techbase. After that, I was ready.

Also, there's KDevelop. It seems to be a very neat IDE, that integrates with many tools common in KDE's development (Git, CMake, Make, even Review Board!). One nice feature it has is to create a project from a CMakeLists.txt file. The auto completion works wonders, too. I still didn't try to integrate it with my build environment, so I'm currently compiling from a terminal and only using KDevelop to edit the source code. It's hard to beat the command line interface in practicality, so I'm not very tempted to do everything from the IDE, but I'll give it a try later on.

If all that documentation and kindness from people on IRC is not enough, KDE's Bugzilla has a tag called "junior-jobs", that's used to mark considerably easier tasks - those that are appropriate for a beginner to tackle. vHanda pointed me to this task. It was very simple indeed. I provided the patch, and it was accepted and pushed! Hurray!

The next issue I tackled was to add a button to make Nepomuk check for new/changed files. KDE has a standard regarding usability, and I've already learned some from that. Choosing where to put a button was harder than I imagined, especially because programmers rarely worry too much about such decisions. Just in reading the Techbase and discussing my patch in the code review, besides learning some things, I realized how bad many user interfaces I made before were. There are some simple tips, like not using an annoying dialog to tell the user something not-that-important and force him to click OK, that I had always missed... Usability is one more thing KDE will teach me :)

However, this last patch won't be pushed soon, because the KDE Release Schedule has already closed the period in which new strings can be added. That's fair, given that the UI needs to be translated and there isn't much more time do to that before the next release, KDE SC 4.10. So, it will have to wait until 4.11.

But before that there's plenty to do! There are new issues and junior jobs in the Bugzilla waiting for patches. Doing them is a concrete way to get more used to Nepomuk's architecture, source code and tools. As soon as I get the time, I'll code some!

Summarizing: it's wasn't hard to start contributing. There is work for newcomers, and the community is open, just like the source code :)

Friday, November 30, 2012

Hello world!

Hi there! I'm Gabriel Poesia, a Brazilian Computer Science undergraduate student beginning a journey in free software and competing in programming contests. It's the first time I post to a blog. I'm starting to write this one to:

- Describe my first steps in contributing to free software, namely KDE and Nepomuk,

- Write about interesting computational problems and algorithms, mainly those related to programming contests, like ICPC and TopCoder,

- Write more! I rarely write in English when not programming, and I think it's a good opportunity to practice it.

For these three reasons, I'm sure this blog will be useful to me. I'd be happy if anyone else could benefit from it. I'll try to write here twice a week.

So, where did this idea come from? I thought it was already time to give something back to the free software community. We all use lots of free software, directly or indirectly. Just to start, even if the browser you're using to read this post isn't free software, you probably found this page using Google, which internally uses a lot of FOSS.

If you try to measure how your life is affected by the free software movement as a whole, you rapidly realize it's impossible to contribute back in the same amount. In spite of that, we can do something (little, but still something) to help.

Gratitude is already a nice reason to contribute, but there are tons more. For instance, it's an invaluable opportunity to learn about API design, large software release cycles, real-world technologies and also, as an ICPC contestant, apply non-trivial algorithms to real problems. Additionally, you get to know very experienced and knowledgeable developers from all over the world, and make something that impacts many users.

Nepomuk, the project in which I'm starting to get involved, has all of that. It's a framework that intends to provide the whole KDE Desktop with metadata and a powerful search that goes way beyond file names. That means it interacts with many others projects, and thus needs to offer a very well designed API. Also, given the amount of information we have in our hard disks, it needs to handle large sets of information efficiently, and as seamless as possible in the user's point of view, as you don't want a slow browser because there's something indexing your files.

If that's not enough, it also utilizes many technologies that will probably be everywhere in the future: ontologies. If you've heard of the Semantic Web before, then you have a pretty good idea of what's the philosophy behind Nepomuk, but targeting the desktop. And the Semantic Web is based on ontologies and file formats like RDF. And so is Nepomuk.

That seems enough for a first post. In the next ones, I want to talk a little more about my first steps in Nepomuk. This week, I managed to get a commit pushed for the first time. I'd find it very nice if more "first commits" from other people appear after this one :)