The SSIS Lookup Transformation is a classic SSIS component where the values of common columns between an input and a reference dataset are compared to extract additional columns of interest; that is, the objective is to add data from one dataset to other. Overall, a lookup transformation requires three elements: an input dataset, a reference dataset, and an equi-join.
In order to illustrate the lookup concept, the query below shows a simple example of a lookup operation implemented in T-SQL.
select c.CurrencyId, s.Amount
from Sale s -- Input dataset
inner join Currency c -- Reference dataset
on s.Currency = c.Currency -- Equi-Join
- Input dataset: this is the base dataset that needs some extra information from the reference dataset. In the example above this would be the Sale table.
- Reference dataset: this is the dataset providing additional information not available on the input dataset. In the example above this would be the Currency table.
- Equi-join: this is the clause joining the input and the reference datasets which uses equality (=) for comparisons. See line 4 on Query 1.
However, when comparing T-SQL lookups with the SSIS Lookup Transformation there is one major catch as the SSIS transformation only returns the first match when there are multiple matches in the equi-join. Here’s the official documentation from books online:
- “If there is no matching entry in the reference dataset, no join occurs. By default, the Lookup transformation treats rows without matching entries as errors. However, you can configure the Lookup transformation to redirect such rows to a no match output…
- If there are multiple matches in the reference table, the Lookup transformation returns only the first match returned by the lookup query. If multiple matches are found, the Lookup transformation generates an error or warning only when the transformation has been configured to load all the reference dataset into the cache. In this case, the Lookup transformation generates a warning when the transformation detects multiple matches as the transformation fills the cache.”
This means that a lookup transformation cannot be seen as a standard relational join. A relational join can handle a many-to-many relationship between two datasets; whereas, a lookup operation, by definition, expects a many to one relationship.
SSIS Lookup Transformations
Creating a lookup transformation in SSIS is a relatively simple, visual process where all you need is an input dataset, a reference dataset, and an equi-join; however, creating optimised lookup transformation in SSIS may be a different kettle of fish. In this section, I won’t bother with listing a bunch of screenshots showing you how to put boxes together to do a lookup transformation and instead I will list a set of common SSIS techniques used to create optimised lookup transformations.
If you are after some basic information about creating lookup transformations in SSIS then I recommend you to watch this video from Carla Sabotta to learn visually how to setup a SSIS Lookup Transformation
Optimising SSIS Lookups
So far I have explained what a lookup transformation is and I have provided a simple T-SQL example to illustrate the concept. In this section I move ahead and provide some important optimisation tips for this transformation since lookups are particularly prone to performance issues.
1. Caching the lookup input dataset (warming cache lookups)
Since SQL Server 2008, SSIS Lookup Transformations have available two data source connection types: the OLE DB Connection Manager and the Cache Connection Manager (CCM). The former is used to specify a database as the data source for the datasets, the latter is used to specify a file or memory (cached data) as the data source for the datasets. In simple words, the OLE DB option connects to a database to fetch the data and the cache connection option connects to a local file or uses local memory to fetch the data. Note that when using the CCM connection type the cached dataset can be reused several times even outside the current package and the CCM relies on the Cache Transform Transformation (CCT) to cache the dataset (CCM consumes and CCT creates). Overall, where there is enough memory available to handle your input dataset the cache connection option provides the best performance.
2. Caching the lookup reference dataset
The key to improve the performance of a lookup transformation is to minimise the I/O time required for every lookup operation; that is, to reduce the time taken to seek every value from the reference dataset. As it happens, in the computing world I/O operations in RAM are normally several times faster when compared to hard-drive operations and consequently any lookup transformation that maximises the use of RAM will have better performance.
Luckly, the SSIS Lookup Transformation has been developed and progressively improved with this notion in mind and the Microsoft SSIS team has made available some options to cache the reference dataset in RAM and consequently easily improve the overall performance of the transformation.
Overall, the SSIS Lookup Transformation offers the following cache modes:
- Full Cache: before performing the lookup transformation, the reference dataset is loaded completely into memory. From an I/O perspective this mode offers the best performance; however, you need to consider that some extra time will be required to copy the whole dataset into memory and the complete referece dataset may not fit into the available memory.
- Partial Cache: the reference dataset is copied into memory on-demand; that is, the reference data of every lookup operation is initially fetched as per normal from the data source (e.g. through the network from a database server) and then copied into memory so next time the same lookup operation uses reference data from memory.
- No Cache: no reference data is cached hence all lookup operations retrieve the reference data directly from the data source. This mode normally does not perform very well and it is only recommended with one-time lookup transformations that are to big to be loaded into memory and perhaps there is not other better alternative like using TSQL to perform the same operation.
More information about the different cache modes can be found in this MSDN post.
3. Determining the cache size
The problem with the notion of caching input or reference datasets is that they may not fit into the available RAM or they may consume to much of the available RAM. That is, the performance gain is going to be limited by the amount of data that can be safely stored in the actual RAM (this is a quite common trade-of in the optimisation world).
Fortunately, the SSIS team has provided some simple guidelines to estimate the size of the cached data. This information is valuable to determine properly which cache mode can be used when using the lookup transformation and avoid wasting precious developing time. For example, there is no point on trying to cache a 100GB dataset on a computer with 32GB of RAM.
The required cache size can be estimated using the following formulae:
- Cache size when using Full Cache mode (bytes): RowSize+ 20 + (4 * NumberColumns)
- Cache size when using Partial Cache model (bytes): RowSize + 36 + (4 * NumberColumnsReferenceDataset)
- RowSize is the row size of the reference dataset.
- The numbers 20, 36, and 4 are constants.
- NumberColumns is the sum of the number of columns in the input and reference dataset.
- NumberColumnsReferenceDataset is, well, the number of columns in the reference dataset.
More details about the lookup cache size can be found in this MSDN post.
4. Minimising the size of the input dataset (the source table)
In the SSIS data source always select manually the columns required by the input dataset to avoid adding any unnecessary columns to the SSIS pipeline, this can be done through T-SQL or the visual column selector of the data source.
When possible cache your datasets, the cache is your friend. This is all I have about the SSIS Lookup Transformation for now. Until the next one!