DynamoDB Scan - What It Is & Why You Should (Almost) Never Use It
Written by Rafal Wilinski
Published on 2020-05-13
What Is DynamoDB Scan?
Scan is one of the three ways of getting the data from DynamoDB, and it is the most brutal one because it grabs everything. Scan operation "scans" through the whole table, returning a collection of items and their attributes. The end result of a scan operation can be narrowed down using
To run a Scan operation using CLI, use following command:
Should I use DynamoDB Scans?
Generally speaking, no. Scans are expensive, slow, and against best practices. In order to fetch one item by key, you should use
Get operation, and if you need to fetch a collection of items, you should do that using Query. Learn more about differences between Scan and Query.
When To Use DynamoDB Scan
But sometimes using scans is inevitable, you only need to use them sparingly and with knowledge of the consequences. Here are use-cases by scans might make sense:
- Getting all the items from the table because you want to remove or migrate them
- If your table is really small (< 10 MB) and it takes a few calls to scan through all of it
- Scan operation can be also used on secondary index, especially if we're dealing with sparse index. In this case, scan will return only items that have an attribute which is indexed by the selected index, e.g. deleted entities.
Restricting usage of DynamoDB Scans using AWS IAM
If you don't want to rely on "good will" and want to codify the best practices instead, you can restrict usage of Scans at all for all users in your organization/AWS account using following IAM policy:
DynamoDB Scan Examples
After reading the above content, if you feel that the scan query still makes sense for your use-case, then we've got you covered. Here are different methods and scan query code snippets you can copy-paste.
- Create DynamoDB Scans (and Queries) Visually
- DynamoDB Scan in Node.js
- DynamoDB Scan in Python (using Boto3)
- DynamoDB Scan using AWS CLI
Similar to the Query operation, Scan can return up to 1MB of data. If the table contains more records that could be returned by Scan, API returns
LastEvaluatedKey value, which tells the API where the next Scan operation should start. The returned value should be passed as the
ExclusiveStartKey parameter for the subsequent call.
DynamoDB Scan Filtering
In order to minimize the amount of data returned from the database to the application, you can use
FilterExpressions. Filter expressions allow you to select and return only a subset of scan results. This behavior is very similar to the SQL's
WHERE clause and due to it's easiness, it might be tempting to replace complicated Query syntax with this one. However, you shouldn't do that!
FilterExpressions are not affecting the actual act of reading from the database! Filters are applied after the data being read from the database and before returning it to the user. It means that the load on the database and cost of executing the query is still the same (because we're billed on the "scanned space"), the only thing that changes is the result returned to the client and size of the payload sent between you and AWS Cloud.
This behavior can be observed in the payload returned by Scan operation. With
FilterExpression filtering a subset of your original data, you can see that the
Count are different. While
ScannedCount tells us about the number of items that matched our request before filtering, the
Count equals to the length of the Items list or the number of items after applying the filter expression.
How fast is DynamoDB scan?
DynamoDB Scan is not a fast operation. Because it goes through the whole table to look for the data, it has
O(n) computational complexity. If you need to fetch data fast, use Query or Get operations instead.
What is the DynamoDB scan cost?
DynamoDB Scan cost depends on the amount of data it scans, not the amount of data it returns. Even if you narrow down the results returned by the API using
FilterExpressions, you'll be billed by the amount of data in went through to find the relevant results.
Parallel Scan in DynamoDB
Scans are generally speaking slow. To make that process faster, you can use a feature called "Parallel Scans" which divide the whole DynamoDB Table into Segments. A separate thread/worker then processes each Segment so
N workers can work simultaneously to go through the whole keyspace faster.
Creating Parallel Scan is quite easy. Each of your workers, when issuing a Scan request should include two additional parameters:
Segment- Number of segments to be scanned by a particular worker
Total Segments- Total amount of Segments/Workers/Threads
But, be careful with Parallel scans as they can drain your provisioned read capacity pretty quickly incurring high costs and degrading the performance of your table.
© 2021 Dynobase