DynamoDB Scan - What It Is & Why You Should (Almost) Never Use It
Written by Rafal Wilinski
Published on May 13th, 2020
Time to 10x your DynamoDB productivity with Dynobase [learn more]
What Is DynamoDB Scan?
Scan is one of the three ways of getting the data from DynamoDB, and it is the most comprehensive one because it grabs everything. Scan operation "scans" through the whole table, returning a collection of items and their attributes. The end result of a scan operation can be narrowed down using FilterExpressions
.
To run a Scan operation using CLI, use the following command:
Should I use DynamoDB Scans?
Generally speaking, no. Scans are expensive, slow, and against best practices. In order to fetch one item by key, you should use the Get
operation, and if you need to fetch a collection of items, you should do that using Query. Learn more about the differences between Scan and Query.
When To Use DynamoDB Scan
But sometimes using scans is inevitable; you only need to use them sparingly and with knowledge of the consequences. Here are use-cases where scans might make sense:
- Getting all the items from the table because you want to remove or migrate them.
- If your table is really small (< 10 MB) and it takes a few calls to scan through all of it.
- Scan operation can also be used on a secondary index, especially if we're dealing with a sparse index. In this case, the scan will return only items that have an attribute which is indexed by the selected index, e.g., deleted entities.
Restricting usage of DynamoDB Scans using AWS IAM
If you don't want to rely on "good will" and want to codify the best practices instead, you can restrict the usage of Scans for all users in your organization/AWS account using the following IAM policy:
DynamoDB Scan Examples
After reading the above content, if you feel that the scan query still makes sense for your use-case, then we've got you covered. Here are different methods and scan query code snippets you can copy-paste.
- Create DynamoDB Scans (and Queries) Visually
- DynamoDB Scan in Node.js
- DynamoDB Scan in Python (using Boto3)
- DynamoDB Scan using AWS CLI
DynamoDB Pagination
Similar to the Query operation, Scan can return up to 1MB of data. If the table contains more records than could be returned by Scan, the API returns a LastEvaluatedKey
value, which tells the API where the next Scan operation should start. The returned value should be passed as the ExclusiveStartKey
parameter for the subsequent call.
Learn more about Pagination in DynamoDB.
DynamoDB Scan Filtering
In order to minimize the amount of data returned from the database to the application, you can use FilterExpressions
. Filter expressions allow you to select and return only a subset of scan results. This behavior is very similar to SQL's WHERE
clause and due to its simplicity, it might be tempting to replace complicated Query syntax with this one. However, you shouldn't do that!
FilterExpressions
do not affect the actual act of reading from the database! Filters are applied after the data is read from the database and before returning it to the user. This means that the load on the database and the cost of executing the query is still the same (because we're billed on the "scanned space"), the only thing that changes is the result returned to the client and the size of the payload sent between you and AWS Cloud.
This behavior can be observed in the payload returned by the Scan operation. With FilterExpression
filtering a subset of your original data, you can see that the ScannedCount
and Count
are different. While ScannedCount
tells us about the number of items that matched our request before filtering, the Count
equals the length of the Items list or the number of items after applying the filter expression.
How fast is DynamoDB scan?
DynamoDB Scan is not a fast operation. Because it goes through the whole table to look for the data, it has O(n)
computational complexity. If you need to fetch data fast, use Query or Get operations instead.
What is the DynamoDB scan cost?
DynamoDB Scan cost depends on the amount of data it scans, not the amount of data it returns. Even if you narrow down the results returned by the API using FilterExpressions
, you'll be billed by the amount of data it went through to find the relevant results.
The exact cost of the operation depends on the table's Capacity Mode; you can estimate it using our free pricing calculator.
Parallel Scan in DynamoDB
Scans are generally slow. To make that process faster, you can use a feature called "Parallel Scans" which divides the whole DynamoDB Table into Segments. A separate thread/worker then processes each Segment so N
workers can work simultaneously to go through the whole keyspace faster.
Creating a Parallel Scan is quite easy. Each of your workers, when issuing a Scan request, should include two additional parameters:
Segment
- Number of segments to be scanned by a particular worker.Total Segments
- Total amount of Segments/Workers/Threads.
But, be careful with Parallel scans as they can drain your provisioned read capacity pretty quickly, incurring high costs and degrading the performance of your table.
Best Practices for Using DynamoDB Scan
While scans should be used sparingly, there are some best practices to follow when you do need to use them. First, always try to minimize the amount of data scanned by using FilterExpressions
effectively. Second, consider using Parallel Scans
to speed up the process, but be mindful of the read capacity units consumed. Third, monitor your scan operations using AWS CloudWatch to keep an eye on performance and cost. Lastly, always evaluate if a Query
operation can achieve the same result more efficiently before opting for a scan.