The Problems with Visual Programming Languages in Data Engineering

Recently I had a conversation about the value proposition of visual programming languages, especially in data engineering. Drag-and-drop ETL tools have been around for decades and so have visual tools for data modelling. Most arguments in favour of or against visual programming languages come down to personal preference. In my opinion there are, however, three fairly objective reasons why visual tools are a bad idea in data engineering.

Continue reading

Tagged , , ,

A Quickie on Batch-Updating Non-Key Columns in Apache Phoenix

Apache Phoenix is a SQL skin for HBase, a distributed key-value store. Phoenix offers two flavours of UPSERT, but from that it may not be obvious how to update non-key columns, as you need the whole key to add or modify records. Continue reading

Tagged , , , , , ,

Shell Scripts to Ease Spark Application Development

When creating Apache Spark applications the basic structure is pretty much the same: for sbt you need the same build.sbt, the same imports, and the skeleton application looks the same. All that really changes is the main entry point, that is the fully qualified class. Since that’s easy to automate, I present a couple of shell scripts that help you create the basic building blocks to kick-start Spark application development and allow you to easily upgrade versions in the configuration. Continue reading

Tagged , , , , , , ,

A Quickie on Spark Actions, Laziness, and Caching

Time is important when thinking about what happens when executing operations on Spark’s RDDs. The documentation on actions is quite clear but it doesn’t hurt to look at a very simple example that may be somewhat counter-intuitive unless you are already familiar with transformations, actions, and laziness. Continue reading

Tagged , , ,

An Overview of Apache Streaming Technologies

There are many technologies for streaming data: simple event processors, stream processors, and complex event processors. Even within the open-source community there is a bewildering amount of options with sometimes few major differences that are not well documented or easy to find. That’s why I’ve decided to create an overview of Apache streaming technologies, including Flume, NiFi, GearpumpApex, Kafka StreamsSpark Streaming, Storm (and Trident), Flink, Samza, Ignite, and Beam. Continue reading

Tagged , , , , ,

ETL: A Simple Package to Load Data from Views

A common, native way to load data into tables in Oracle is to create a view to load the data from. Depending on how the view is built, you can either refresh (i.e. overwrite) the data in a table or append fresh data to the table. Here, I present a simple package ETL that only requires you to maintain a configuration table and obviously the source views (or tables) and target tables. Continue reading

Tagged , ,

Read That For Me

You’re a busy (wo)man. You cannot keep abreast of all the goings-on in the world of data, technology, and science, even though you would like to.

So, allow me to recap articles and posts I think are interesting or thought provoking.
Continue reading

Tagged , , , , , , , , ,

The Case for Industrial Data Science

It has — perhaps somewhat prematurely — been called the sexiest job of the twenty-first century, but whether you buy into the Big Data hype or not, data science is here to stay.

The available literature, the majority of courses in both the virtual and real world, and the media all purport the image of the data science ‘artiste’: a data bohemian who lives among free, like-minded spirits in lofty surroundings, who receives sacks of money in exchange for genuine works of art created with any possible ‘cool’ tool that flutters by in whatever direction the wind is blowing that day.

The reality for many in the field is quite different. Corporations rarely grant anyone unfettered access to all data, and similarly they are not willing to try and buy every new tool that hits the market, simply to satisfy someone’s curiosity. Furthermore, industrial data science has requirements that are much stricter than what is commonly taught in programmes around the world, and it’s time to make the case for industrial data science. Continue reading

Tagged , , , ,

Setting up Scala for Spark App Development

Apache Spark is a popular framework for distributed computing, both within and without the Hadoop ecosystem. Spark offers interactive shells for Scala as well as Python. Applications can be written in any language for which there is an API: Scala, Python, Java, or R. Since it can be daunting to set up your environment to begin developing applications, I have created a presentation that gets you up and running with Spark, Scala, sbt, and ScalaTest in (almost) no time. Continue reading

Tagged , , , , ,

An Overview of File and Serialization Formats in Hadoop

Apache Hadoop is the de facto standard in Big Data platforms. It’s open source, it’s free, and its ecosystem is gargantuan. When dumping data into Hadoop, the question often arises which container and which serialization format to use. In this post I provide a summary of the various file and serialization formats. Continue reading

Tagged , , , , ,