Most of the courses I attended during my Bachelor were Java oriented with some C#, C and Haskell programming. C++ never got really touched, and if, then just to illustrate some theoretical stuff, meaning some C++ code was used to describe a design pattern etc..Well in the Software Engineering Project course we covered C++ pointers, references and how the stack and heap gets managed, but again, just theory. So you always hear about how hard it is to program C++, how awkward pointer management and memory allocation and deallocation is.
Now for the Master course "Data Mining and Data Warehousing" we really have to get in touch with C++. The task of my group is to develop a spectral clustering algorithm and to integrate it into a system called XVDM (developed by the DIS Centre at the Free University of Bolzano). We already presented the first prototype where the different clusters are just printed on the command line. Now after starting with the real integration into the big system (XVDM) we faced quite a lot of difficulties and then you really get a feeling of all this theoretical stuff. So for instance the integration seemed to work well on my collegues Debian installation (running inside a virtual machine on top of windows) but the SAME code produced a segmentation fault on my Ubuntu machine (installed natively).
Using "
valgrind" helped to identify the problem (thanks to Prof. Arturas Mazeika for this hint). It allows you to start your program similar as the following
valgrind ./SpectralClustering < ./data.csv
where "data.csv" contains the data to be clustered which is passed via redirection to our program. Valgrind then outputs the memory leaks and what may even be more important, the un-initialized - but used - variables which actually was the problem in our case. It turned out that our module integrated into the XVDM system did not initalize it in the right order meaning we had something like
init glut
create cluster objects
create DBScenery object
DBScenery->addAlgorithm(clustering)
initOpenGL()
(this is just some sort of pseudocode). The problem was that the addAlgorithm(...) method made use of variables which got instantiated inside the initOpenGL() call which then of course produced a segmentation fault. That seems to be clear, but how was the program then able to run on my colleagues machine, making use of a not yet instantiated variable?? I don't know..
Anyway, the clustering integration works now as you can see :)
I suppose we have to tune it now, in order to be able to process much larger datasets.
Questions? Thoughts? Hit me up
on Twitter