I would like to share with you an interesting scenario I encountered lately.
Remove Duplicates' functionality in excel keeps only the first occurrence of each unique value, and removes all of the later occurrences.
In power query, this logic seems to work the same, but it does contain an interesting "bug".
Let's look at this example:
Let's say we want to use "Remove Duplicates" in power query on this table, and expect to get back only 1 row of Group 'A'.
Scenario 1: Simple Remove Duplicates
If we import this table and run the remove duplicates, we expect to keep Group 'A' with amount = 1 , and indeed that's what happens, and Group 'A' with amount = 4 is removed
Scenario 2: Sort first and then Remove Duplicates
Now for the interesting part.
What if we want to first sort the table in descending order by Amount, and basically want to keep 1 instance of each Group, and keep the one that has the highest amount.
We would expect that after sorting it, removing the duplicates would keep the first occurrence of each Group.
But that does not work as expected.
Here is what we get after removing duplicates:
Not sure about the cause of this, but I assume that Power Query's remove duplicates does not work on the actual order of the rows, but on a hidden internal index that was given to the rows once we start the query, before the sort!
In order to work around this, I found that using Table.Buffer to store the sorted table in the cache solves the problem (I assume that this hidden-internal-row-index is remapped after doing this).
Here is the final code:
let
Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Sorted Rows" = Table.Sort(Source,{{"Amount", Order.Descending}}),
Buffered = Table.Buffer( #"Sorted Rows"),///<————– The Workaround
#"Removed Duplicates" = Table.Distinct(Buffered, {"Group"})
in
#"Removed Duplicates"
If you want to look further into these scenarios, please feel free to download this file:[wpdm_file id=4 desc="true" ]