Latest Posts

How to get row count in Azure Data Factory?


In Azure Data Factory (ADF), obtaining the row count from a data source involves using a combination of activities within a pipeline. The most common method is to use a Lookup activity to execute a query that counts the rows in the dataset.

Create a Dataset

  • Open Azure Data Factory Studio: Go to your Data Factory instance and open the ADF Studio.
  • Note: If we use Lookup activity on top of a data set, we’ll directly get the number of rows as an output of Lookup activity BUT lookup only supports 5000 records, that’s why we have created this example to get the count of the rows even if it is more than 5000 rows.
  • Create a Dataset: In the “Author” tab, create a dataset that points to the data source (such as a table in Azure SQL Database, Azure Blob Storage, etc.) for which you want to count the rows. Here, we are going to store excel file in Azure BLOB storage and consider this file as our data source. We are going to count the rows from this excel file.
  • In the Parameters section, define this parameter- VarStartFinal. You can specify the default value as shown. A2:A3 is the range of the excel file that means in column - A, we are going to consider from row 2 to 3 (This is just a default value so don’t worry).
  • For the same dataset, go to Connection tab and click on Add dynamic content in Range field like this.
  • The expression builder window will open. Click on Parameters >> VarStartFinal (the variable we defined earlier).

Create a Pipeline


  • Still in the “Author” tab, create a new pipeline.
  • Put all these variables in pipeline.
  • Now, put an Until action in the pipeline. For Until action, click on Settings tab and put this dynamic content in Expression.

    @equals(variables('CurrentCount'),0)
  • Basically, we are going to perform all the activities within Until action repetitively, till the condition of Until action is fulfilled.
  • We are going to add four actions within Until action.
      • Lookup: Click on the Settings tab for Lookup activity. Select Source dataset and specify the dataset property as shown.

        @concat('A',variables('StartNumber'),':','A',variables('EndNumber')
      • Set Variable: Now, click on the Set Variable action >> Settings tab. We are setting the value for the variable – CurrentCount. Paste this value in Value field.

        @activity('Lookup3').output.count
      • Set Variable: We are going to set the second variable. We will set the variable – CurrentTotalCount

        @add(activity('Lookup3').output.count,variables('TotalCount'))
      • If Condition: Now, within If condition, we have 3 more actions for set variable but first let’s look at the If condition. Put this for If condition expression.

        @greater(activity('Lookup3').output.count, 0)
      • Now let’s look at the 3 activities within If condition. All these 3 activities are in True statement of the If condition. We are keeping False (Else) part empty.
      • Set Variable: Set the first variable – TotalCount with below formula.

        @variables('CurrentTotalCount')
      • Set Variable: Set the second variable – StartNumber with below formula.
        @string(add(variables('TotalCount'),1))
      • Set Variable: Set the third variable – EndNumber with below formula.
        @string(add(variables('TotalCount'),4999))
      • Configure the Lookup Activity:
      • Link the activity to the dataset you created.
      • In the settings of the Lookup activity, you will need to provide a query that returns the row count. For example, if your dataset is a SQL database, your query will be something like SELECT COUNT(*) AS RowCount FROM YourTableName.
  • After you run the pipeline, when you check the variable – EndNumber value, you will get the total number of rows in the excel file.
  • Debug: Test the pipeline using the Debug feature to make sure it correctly retrieves the row count.
  • Publish: Once the pipeline works as expected, publish your changes.

Additional Tips

  • Error Handling: Implement error handling in the Lookup activity to manage scenarios like empty results or connectivity issues.
  • Performance Considerations: If the dataset is very large, consider the performance impact of running a full count query. In some cases, it may be beneficial to store and update a row count in a separate control table, especially for very large datasets.
  • Incremental Loads: For incremental loading scenarios, consider maintaining a row count as part of your incremental load metadata to avoid having to count the entire dataset each time.

We value your Feedback:

Page URL:

Name:

 

Email:

 
 

Suggestion:

 

© 2024 Code SharePoint