Data preparation: how to choose the right solution for your needs?

Data preparation: how to choose the right solution for your needs?

I regularly use 4 data preparation solutions: Alteryx, Tableau Prep, Talend, and Anatella. The G2 benchmark lists no less than 40 of them. This article explains why the G2 benchmark is not necessarily relevant and what criteria to use to compare the ETL (Extract – Transform – Load) solutions best suited to your needs.

Summary


crédits : Shutterstock

Introduction: what is an ETL?

If you are reading this article, chances are you know what an ETL is. So, I’ll give you a quick reminder. An ETL (or Extract – Transform – Load) is a data preparation solution.  With this type of software, you can

  • import data (Extract) from different environments
  • work on them (Transform) to make them more qualitative, to enrich them, etc…
  • and “send” them (Load) to a new environment after you have prepared them

ETL is not in itself a tool reserved for specialists. Most ETLs work on the No-CODE principle, using “boxes” that correspond to particular functionality and linking them together to form a “pipeline.” It’s straightforward and compelling.

pipeline Tableau Prep

A data preparation pipeline using Tableau Prep



pipeline ETL Anatella

Data preparation in Anatella


pipeline Talend ETL

A data preparation pipeline in Talend


ETL pipeline Alteryx

Data preparation in Alteryx


I give you some examples of these pipelines above in 4 solutions: Talend, Anatella, Alteryx, Tableau Prep. As you can see, all of them look the same.

So, my definition of an ETL includes the “data transformation” dimension.


crédits : Shutterstock

The G2 benchmark criteria

Let’s now come to the ETL benchmark proposed by G2. For those who don’t know G2, it is a marketplace that collects professional users’ opinions on every type of software imaginable. I think their slogan makes a lot of sense:

Why is it easier to get unbiased information about a $100 hotel room than a $100,000 piece of software?

And indeed, I think it is crucial to have quality information, if possible, objective when buying software. The buyer is often at a loss when faced with a plethora of solutions for the same need.

G2 has a category reserved for ETLs (118 at the time of writing this article) and another for data preparation solutions (about 40). The benchmark has 2 axes:

  • market presence
  • satisfaction

The map is divided into 4 quadrants which will remind you of the benchmarks proposed by Gartner (the famous “magic quadrants“):

  • Leaders
  • Contenders
  • Niche
  • High performers

 

Benchmark data preparation G2

The comparison of data preparation tools proposed by G2

Frankly, I like the principle of G2, but I wonder if this card is really useful.

First of all, there is a problem with terminology. The 118 tools that are listed in the ETL category are more like “EL.” They allow you to switch from one file format to another. The “real” ETLs that transform data is in the “data preparation” category. I find that this brings a lot of confusion.

Secondly, the segmentation that is proposed does not make much sense for the ETL user. A map should be a decision-making tool. It should allow the user to immediately identify the most appropriate solution(s) for his use. But here, it is a “marketing” view that is proposed, which is not helpful for those looking for a technical solution to a particular problem.

The question is, therefore, to know which functionalities of an ETL are differentiating. This is what I will discuss in the next section.


crédits : Shutterstock

ETL: The most important criteria to choose

First of all, I would like to point out that I do not claim to have the absolute truth. As I said in the introduction, there are 118 tools listed in the G2 ETL benchmark. Nobody knows them all, and a comparison is therefore necessarily subjective.

In the following paragraphs, I will develop the aspects of an ETL that I consider as differentiating.

Criterion n°1: data processing available in the ETL

Seriously, this is THE number 1 criterion I would take into consideration if I had to buy an ETL tomorrow. The more processing functionalities (cleaning, transformation, …) you have available, the less you will struggle to prepare your data. Because the features you need are not available natively, you will have to program them yourself. And as far as I’m concerned, I prefer to use No-Code rather than bother with programming. A concrete example of a handy feature is the fuzzy matching feature. I have compared different fuzzy matching algorithms and approaches in different ETLs, and some are better than others. Having such features “natively” is a significant advantage.

functionalities Tableau Prep

The features available in Tableau Prep are pretty limited but cover the most common needs.

Criterion n°2: speed

I think the importance of processing speed is underestimated. However, as I have shown in this article, the differences can be substantial even on trivial operations like sorting. When handling large files, a lousy ETL can quickly become a hindrance. For me, speed is essential because I often run and reconcile large files and … because I hate waiting.

Criterion n°3: number of in and out connectors

More and more connectors are natively embedded in ETLs to serve the most diverse needs. I pay particular attention to the output connectors because, in most cases, I reinject the data into a data visualization solution. For Tableau, exporting to a .hyper file is particularly useful, and for this reason, I particularly appreciate Anatella for complex processing and Tableau Prep for lighter processing.

connectors Talend

Talend proposes 16 ways to connect to data

Anatella connectors

Anatella proposes 38 different connectors.

inout connectors Alteryx

Alteryx proposes 32 different data connectors.

Criterion n°4: visualization features

It may seem entirely inappropriate to want to do data visualization in an ETL, but it is convenient. When you are doing data mining, being able to visualize data is essential. And if you have to export your dataset every time and open Tableau or Power BI, good luck. In the pipeline below, created with Anatella) you see that I have a box that allows me to call a chart in R. This is very handy because it allows me to see directly what results I can expect.

 

pipeline ETL anatella

Pipeline made with Anatella in which I inserted a visualization box to check the result quickly.


Conclusion

In the end, I think that the choice of an ETL must be based on objective bases, dictated by the needs of the end-users. Each business context is different, and so are the requirements.

My advice would be to proceed as follows:

  1. Define the functionalities that will allow users to be more efficient
  2. Choose a series of tools and test them in a “laboratory” world for a sufficiently long period
  3. Rate each solution according to its criteria
  4. Make your choice

This approach seems to be much sounder than relying on a general benchmark.

Feel free to comment and indicate your criteria of choice! Thank you.

 


Posted in Big data.