My personal blog
This is where I write about things that I find interesting! My topics of interest are Bioinformatics, Software Development and Machine Learning!
Who am I?
I am currently a Bioinformatics PhD student in Professor Andrew Martin’s lab, at UCL. I am also a Shogun.ML core developer, where I mainly optimise code, improve the API and do code reviews. In 2018/2019 I was a part time software developer in the Research Engineering team at the Alan Turing Institute, where I worked on enabling reproducible machine learning pipelines with shogun.
Things I work on
In a previous post I introduced the work that we have done at Shogun to develop a computational graph backend for our linalg module. The graph abstracts away all the computations and when built/compiled it does all the optimisations, such as merging ops and allocations, and is ready to receive data. The problem with this approach is that the user is exposed to a two-step process, build and evaluate, rather than a single step execution that most of us are used to with linear algebra libraries. The former method was largely popularised in TensorFlow 1.x (inspired by Theano), and the latter started appearing in later TensorFlow versions with eager execution and now with JAX. I am not familiar with all the details, but from my understanding, eager execution forces each expression to be evaluated immediately. So, for example, in
y = X.dot(w) + b I would get the value of
y after executing the right-hand side expression. In JAX however,
y is a lazy expression and it will only be calculated until the very last moment. Execution would be triggered by serialisation, so when doing something like
print(y), or looping through the values. The JAX approach can fully leverage the advantages of running graphs, because computation is deferred until after we declared all the calculations we want to perform. In my opinion, this is the better tradeoff between user friendliness and efficiency.
At Shogun we try to come up with very efficient implementations of machine learning algorithms, such as SVMs and other kernel methods. The original approach was to write code with lots of calls to BLAS and LAPACK, sprinkled across the code base. These were then replaced with Eigen calls, which eventually were hidden behind the
linalg module. However, things are changing fast in the machine learning world, both in terms of hardware and software. Nowadays, functionalities such as autodifferentiation are frequently a requirement when working with kernels, see for example GPFlow. However, in Shogun we still rely on manual implementations of gradient calculations, which is both error prone and time consuming. In addition to that, most research setups have heterogeneous hardware, e.g. CPU+GPU/TPU or distributed clusters, and determining where to run an algorithm at compile time is not realistic, or practical. For example, if I want to run a DNN I want to have the control at runtime where it will run, i.e. whether to use a GPU or a cluster. In Shogun, we currently have some support for GPU using ViennaCL, but it is mostly outdated and requires us to keep track of an additional dependency. In conclusion, the machine learning requirements and available resources are very different now compared to when Shogun started in 1999!
I was recently working on optimising some C++ code that does antibody numbering (if you want to find out more about the topic checkout the web server for the original C code here).
I did the usual analysis with linux’s
perf to find some hotspots. These included a call to
std::log2 that I replaced using a low precision version that I found here, and an argsort that was rewritten with a heap data structure (more on this in a future post). After all this work, the executable spent around 27-28% of the time with calls to
C++ is a statically typed language, meaning that at compile time the types of each variable are checked and the validity of operations with these variables is asserted. On the other hand, Python is dynamically typed, and a variable can hold any type.
This leads to an awkward syntax in shogun where the getters need to know the return type at runtime. In other words, if a class member is a float we need to use the
get_real getter, if it is an int we would use
>>> import shogun as sg >>> lr = sg.machine("LibLinearRegression") >>> lr.get_int("max_iterations") >>> # this will raise an error >>> lr.get_real("max_iterations")
Shogun’s core library is written in C++, that is everything from memory management, exception hadling to the linear algebra framework that is required to write the machine learning algorithms. However, C++ is not the laguage of choice of data scientists and even machine learning engineers. This is despite the large effort that has been made in modern C++ to make memory management something almost from the past (use
std::shared_ptr instead) and making types automatically deduced, i.e. with
auto. Most scientist do know how to write Python, statisticians in particular usually know R, most engineers prefer Java, and possibly C#, and then languages such as Go are becoming more relevant. It is no coincidence that these languages, and more, are covered by SWIG.