Analyzing Power Query Performance - When Source Is Large Files - Strategy, solutions & Power BI for Data-Driven organizations

מרץ 10, 2016

Analyzing Power Query Performance – When Source Is Large Files

Key findings

Before explaining the method I used to measure the Power Query performance when dealing with files , Let me give you the bottom line:

Union (append) CSVs is not a cause for slowness
1 CSV file with 2.5M rows and 10 CSV files with 250K rows each merged using binary combine takes the same time to run.
Excel files are 4-5 times slower to load than CSV !
Merging multiple Excel files slow things down a bit? unlike binary combine used in CSVs , The Excel merge is using a custom function to extract the data, And change column types every time it pull an Excel file (changing types is the most time consuming step)
Unlike Power Pivot – Reducing the number of columns does not make it faster

How Can I measure Query performance in Power Query ??

Power query performance was like a black box to me. There is no monitoring on it, No duration counters , Nothing.
The only thing we have is the trace log. Some have covered the topic of connecting to the trace log files, You can find more information about it here and here

The method

After connecting to the trace log and cleaning it a bit, I noticed there is one Action called "RemotePageReader/RunStub" which holds the actual Power query query run time in the duration column so i filtered out the rest.
One other thing to note – Power Query creates a new trace log file every now and then, So i connected to the trace folder and pulled all relevant trace files. After all the cleaning the log looks something like this:

The duration only measure the time it took the Power Query Query to run . Load to data model time is not included.

There is no information about which query is running (at least nothing I can understand), So I had to run one query at a time and refresh the trace log after every run to get the run duration .
I run each query 3-4 times with cleaning cache in the middle, To make sure the duration I get is consistent .

For the tests I used a CSV file with 2.25 million rows and 20 columns, I then cut it in half and copied it , Then cut it 1/10 and copied 9 more times. Then I created another 10 files with 10 columns instead of 20
I then pulled the exact same data into an Excel worksheet, But since you can not store more than 1M rows in Excel file , I created a 225K rows file 10 times.

Summary of the runs

The findings again:

Union (append) CSVs is not a cause for slowness
1 CSV file with 2.5M rows and 10 CSV files with 250K rows each merged using binary combine takes the same time to run.
Excel files are 4-5 times slower to load than CSV !
Merging multiple Excel files slow things down a bit? unlike binary combine used in CSVs , The Excel merge is using a custom function to extract the data, And change column types every time it pull an Excel file (changing types is the most time consuming step)
Reducing the number of columns does not make it faster

Why do I even care ?

I've seen many projects where the data source was files/reports exported from systems. Most of the systems let you can choose to run an automated daily report covering one day, Or loading Monthly/yearly files on daily granularity which run over the previous ones.
you can also sometimes choose between a CSV export and an Excel export. So understanding which approach have a better performance can help.

————————————————————————

Power Query code i used for analyzing the trace:

let
 Source = Folder.Files("C:\Users\idan_000\AppData\Local\Microsoft\Office\16.0\PowerQuery\Traces"),
 #"Filtered Rows" = Table.SelectRows(Source, each not Text.Contains([Name], "EXCEL")),
 #"Combined Binaries" = Binary.Combine(#"Filtered Rows"[Content]),
 #"Imported Text" = Table.FromColumns({Lines.FromBinary(#"Combined Binaries",null,null,1255)}),
 #"Split Column by Delimiter" = Table.SplitColumn(#"Imported Text","Column1",Splitter.SplitTextByEachDelimiter({":"}, QuoteStyle.None, false),{"Column1.1", "Column1.2"}),
 #"Split Column by Delimiter1" = Table.SplitColumn(#"Split Column by Delimiter","Column1.2",Splitter.SplitTextByEachDelimiter({":"}, QuoteStyle.None, false),{"Column1.2.1", "Column1.2.2"}),
 #"Removed Other Columns" = Table.SelectColumns(#"Split Column by Delimiter1",{"Column1.2.2"}),
 #"Filtered Rows2" = Table.SelectRows(#"Removed Other Columns", each [Column1.2.2] <> " Unable to connect to any of the specified MySQL hosts."),
 #"Parsed JSON" = Table.TransformColumns(#"Filtered Rows2",{},Json.Document),
 #"Expanded Column1.2.2" = Table.ExpandRecordColumn(#"Parsed JSON", "Column1.2.2", {"Start", "Action", "ProductVersion", "ActivityId", "Process", "Pid", "Tid", "Duration", "evaluationID"}, {"Start", "Action", "ProductVersion", "ActivityId", "Process", "Pid", "Tid", "Duration", "evaluationID"}),
 #"Changed Type" = Table.TransformColumnTypes(#"Expanded Column1.2.2",{{"Start", type datetime}}),
 #"Parsed Time" = Table.TransformColumns(#"Changed Type",{{"Duration", each Time.From(DateTimeZone.From(_)), type time}}),
 #"Reordered Columns" = Table.ReorderColumns(#"Parsed Time",{"Start", "Action", "Duration", "ProductVersion", "ActivityId", "Process", "Pid", "Tid", "evaluationID"}),
 #"Sorted Rows" = Table.Sort(#"Reordered Columns",{{"Start", Order.Descending}}),
 #"Filtered Rows1" = Table.SelectRows(#"Sorted Rows", each ([Action] = "RemotePageReader/RunStub"))
in
 #"Filtered Rows1"

דרושים