How to use an ETL solution to extract information from an Excel file and reconcile it? After explaining a method to automate the extraction from many Excel files, in today’s post, I will go into more depth and explain how I used ternary operators to extract the desired information.
Presentation of the problem
The problem I have faced is relatively standard. You have a poorly structured file that has been filled in by a user or by a machine, and you want to extract precise information from it. The file format doesn’t matter, but the important thing is to have a method that allows you to extract the data and restructure it.
The extraction problem presented itself to me when I received a package of Excel files from a client. As I explained in my previous post, manual reformatting work would have been possible but not sustainable. So, I had to look for a solution with my favorite ETL (Anatella) to achieve my goal. And it is by using a ternary operator that the magic worked.
Simply put, the ternary operator is the equivalent of the IF function in Excel.
If you are reading this article, you probably know what it is. So, I will remind you of the operator’s syntax and refer those who want to know more to external resources.
The ternary operator is written as follows: (condition)?(iftrue):(iffalse) and allows to return one or the other argument depending on the condition tested.
How to extract data from specific cells of an Excel file
It is common to encounter processes in which Excel templates are created, distributed, filled out, and then collected to analyze them. The data preparation process can quickly become laborious if there are many Excel templates or a plethora of information to extract from each template. However, there is one crucial aspect that we will use to our advantage: the information is always entered in the same cells of the Excel file.
Knowing the cell’s coordinates in the table, I use the following formula to extract the right data
_n == (row number)? column number:
This formula allows me to define the cells from which I want to extract the content and returns an empty cell in all other cases. Here is what this gives for 3 cells.
_n == 75? C5:
_n == 80? C4:
_n == 82? C4:
At the same time, I take the opportunity to define a “label” variable that allows me to associate the cell’s content to the type of information extracted.
The file’s initial structure is kept; I find myself with 2 columns (the content of the cell and the label) and a lot of empty rows. I have to transpose the information on a line thanks to the “unflatten” feature.
The last small difficulty is the date. The extraction of the “date” field is in “elapsed time” format. To use this information in a data visualization solution, you should convert everything into a date/month/year (dd/mm/yyyy) format. However, Excel uses a particular reference date.
Once this last pitfall is over, all you have to do is save the result in the format of your choice.Tags: data mining