In a previous article, I realized using a benchmark of 4 ETL solutions to process a file of one billion lines. Today I test the effect of SSD and proprietary file formats on processing speed in Alteryx, Tableau Prep, Talend, and Anatella. The results are quite unexpected.
Introduction and review
In my previous analysis, I compared the processing speed of 4 data preparation solutions: Alteryx, Talend, Tableau Prep, and Anatella.
After its publication on social networks, which raised several voices to criticize the content (why test the processing speed?) and the form (why not optimize the configuration by placing the file to be processed on an SSD?)
I defended my choices on the speed test by explaining my frustrations with the slowness of some solutions on the one hand and by reminding people that processing time is expensive:
- in minutes spent waiting in front of the machine on the one hand
- in processing costs in the cloud.
Remember that your “cloud” bill is first and foremost made up of CPU rental costs. Storage has become a very affordable commodity.
If you choose a “no-code” ETL solution, you’d better choose one that is fast, primarily if you work in the cloud and use it often.
ETL’s : Alteryx vs. Tableau Prep vs. Talend vs. Anatella
The choice of ETL’s to compare is entirely arbitrary. These are simply the ones I have access to:
- Talend Open Studio v7.3.1
- Tableau Prep 2020.2.1
- Alteryx 2020.1
- Anatella v2.35
The first one is not a “data preparation” solution per se. The last one is a solution ranked in the G2 benchmark as “high performer”. I have been using it for years.
Now let’s get to the results. How long does it take to process a dataset of one billion lines stored on an SSD?
Results: Effect of an SSD on ETL processing time
First of all, let me remind you that I started with a 43.6 GB CSV file (that’s significant!) and that I performed 2 simple operations (a sort and a “group by”). I refer you to the initial article for all the diagrams of the data processing channels. Initially, the processing ran on an HDD with a peak speed of 7,200 rpm.
For this new test, I moved the files to my SSD and ran each query 3 times. I took the lowest value of the 3. Less than 1% difference was measured between the 3 runs.
|without SSD||with SSD||difference|
The results are surprising. While I anticipated using SSDs to have an effect, in the end, there is almost none, except in the case of Anatella, where there is a decrease in the processing time of 6.9% and with Alteryx, where the reduction is 30.1%. Processing using Talend still takes forever and using Tableau Prep, there is a slight increase.
Fortunately, I still had a card to play… and this one will pay off.
Results: The effect of the proprietary data format on the processing time
The other aspect I wanted to investigate was the file format. Alteryx and Anatella propose proprietary file formats that are supposed to improve performance. They are respectively .yxdb and .gel
So, I replaced the 50GB CSV file with a file in the respective proprietary formats. And as you can see, the result is spectacular.
|SSD + CSV file||SSD + proprietary file format||Difference|
The first conclusion I draw is that SSDs do not necessarily bring an improvement in the processing time. It all depends on the solution used.
While a speed-up is noticeable with Alteryx and Anatella, the most evident gain is achieved when the proprietary file format is used. With Anatella in particular, the processing time is optimized to the extreme and drops to 96 seconds.
Tags: data science