Presented at Spark Summit 2015, with a focus Spark performance.
The Spark community has a lot of experience using Spark for offline batch analysis tasks coming from a broad range of use cases. But creating an interactive web application which aims for sub-second response times using Spark as the computation backend is still a somewhat unexplored territory. We wandered into this territory when we built LynxKite, our big graph analysis tool. The tool enables users to interactively explore graphs with billions of vertices and edges. Exploration includes global and local views of the graph featuring visualization of attributes, connections and distributions.
This talk is about the technical challenges – general and domain specific – we faced during building this software and about our solutions. We will talk about problems like scheduler delay, GC pauses, interoperability with other Akka based libraries and solutions like sorted RDDs, prefix sampling, and column based attribute representation.
Presented at the Budapest Spark Meetup, May 2016. A practical introduction to using Spark SQL.
DataFrames in Apache Spark allow you to run good old SQL queries over terabytes of data distributed across a cluster of machines. Now you know roughly what a DataFrame is. But if you come to the talk you will get to see them in action and learn much more about how to use this API.
Presented at the Budapest Spark Meetup, May 2016. The story of how we integrated SQL in LynxKite.
This talk will introduce a complex real-world Spark application, the LynxKite big graph analytics system. Why is it built on Spark? What was straightforward to do and what did we have to spend significant effort on? You can ask any questions about life with Spark, but the focus will be on our integration of Spark SQL: the motivation, the implementation, and the results.
Presented at Big Data Universe, 2016. Advanced Scala techniques for compile-type safe representation of computations.
LynxKite is a proprietary Spark-based graph analytics application. On the backend we represent graphs with a few fundamental types: vertex sets, edge sets, and attributes (which may belong to either and contain arbitrary Scala types). Operations, such as calculating PageRank, take some of these types as their inputs and return some as their outputs. They are much like functions, except for a few peculiarities: they are composed via the frontend UI, they are lazy, they are persistent, and we need to capture rich relationships among the inputs and outputs.
Presented at Spark Summit East, 2017.
Lynx Analytics develops a big graph analysis engine on top of Apache Spark. One of our recent developments is a recurrent neural network library that learns from the structure of the graph in order to predict missing features of vertices.
A real-life use case is demographic estimation where the task is to predict the age of different customers of a telco by exploring their connections to other people, the age of those people and other classical features like internet or phone usage patterns.
One of the main challenges we faced was to develop a training process for our purposes. The usual way of training a supervised learning algorithm considers each vertex as an independent prediction problem. But due to the use of connections between the vertices in our algorithm we cannot treat vertices independently. On the other hand, if you consider the whole graph as one problem, then you do not have any separate training data at all. In this talk we will show some tricks that we used in order to perform the prediction and the training process on the same graph.
The other main challenge is to handle graphs so big that they do not fit into the memory of a single machine and perform really resource-intensive computations on them. To tackle this problem it is necessary to store and make computations on the graph distributedly. The difficulty of this is that we cannot just simply cut the graph into smaller pieces since we need to propagate data via the edges for the training process.
In the talk we will show core algorithmic ideas to tackle the above-mentioned problems and present some experimental results.