2 June 2021 624 words, 3 min. read Latest update : 13 June 2021

Data preparation: how to reduce the processing time by 85%

By Pierre-Nicolas Schwab PhD in marketing, director of IntoTheMinds
In a previous article, I realized using a benchmark of 4 ETL solutions to process a file of one billion lines. Today I test the effect of SSD and proprietary file formats on processing speed in Alteryx, Tableau Prep, Talend, and Anatella. The results are […]

In a previous article, I realized using a benchmark of 4 ETL solutions to process a file of one billion lines. Today I test the effect of SSD and proprietary file formats on processing speed in Alteryx, Tableau Prep, Talend, and Anatella. The results are quite unexpected.

Introduction and review

In my previous analysis, I compared the processing speed of 4 data preparation solutions: Alteryx, Talend, Tableau Prep, and Anatella.

After its publication on social networks, which raised several voices to criticize the content (why test the processing speed?) and the form (why not optimize the configuration by placing the file to be processed on an SSD?)

I defended my choices on the speed test by explaining my frustrations with the slowness of some solutions on the one hand and by reminding people that processing time is expensive:

  • in minutes spent waiting in front of the machine on the one hand
  • in processing costs in the cloud.

Remember that your “cloud” bill is first and foremost made up of CPU rental costs. Storage has become a very affordable commodity.

If you choose a “no-code” ETL solution, you’d better choose one that is fast, primarily if you work in the cloud and use it often.


ETL’s : Alteryx vs. Tableau Prep vs. Talend vs. Anatella

The choice of ETL’s to compare is entirely arbitrary. These are simply the ones I have access to:

  • Talend Open Studio v7.3.1
  • Tableau Prep 2020.2.1
  • Alteryx 2020.1
  • Anatella v2.35

The first one is not a “data preparation” solution per se. The last one is a solution ranked in the G2 benchmark as “high performer”. I have been using it for years.

Now let’s get to the results. How long does it take to process a dataset of one billion lines stored on an SSD?


Results: Effect of an SSD on ETL processing time

First of all, let me remind you that I started with a 43.6 GB CSV file (that’s significant!) and that I performed 2 simple operations (a sort and a “group by”). I refer you to the initial article for all the diagrams of the data processing channels.  Initially, the processing ran on an HDD with a peak speed of 7,200 rpm.

For this new test, I moved the files to my SSD and ran each query 3 times. I took the lowest value of the 3. Less than 1% difference was measured between the 3 runs.

without SSD with SSD difference
Alteryx 2,290 1,609 -30.1%
Anatella 730 679 -6.9%
Tableau Prep 2,526 2,691 +6.5%
Talend 13,954 14,340 +2.7%

The results are surprising. While I anticipated using SSDs to have an effect, in the end, there is almost none, except in the case of Anatella, where there is a decrease in the processing time of 6.9% and with Alteryx, where the reduction is 30.1%. Processing using Talend still takes forever and using Tableau Prep, there is a slight increase.

Fortunately, I still had a card to play… and this one will pay off.


Results: The effect of the proprietary data format on the processing time

The other aspect I wanted to investigate was the file format. Alteryx and Anatella propose proprietary file formats that are supposed to improve performance. They are respectively .yxdb and .gel
So, I replaced the 50GB CSV file with a file in the respective proprietary formats. And as you can see, the result is spectacular.

SSD + CSV file SSD + proprietary file format Difference
Alteryx 1,609 1,116 -30.6%
Anatella 679 96 -85.8%

 


Conclusion

The first conclusion I draw is that SSDs do not necessarily bring an improvement in the processing time. It all depends on the solution used.
While a speed-up is noticeable with Alteryx and Anatella, the most evident gain is achieved when the proprietary file format is used. With Anatella in particular, the processing time is optimized to the extreme and drops to 96 seconds.

 

 



Posted in Data & IT.

Post your opinion

Your email address will not be published. Required fields are marked *