How to Efficiently Process Large CSV Files with Lambda and Step Functions using AWS CDK v2

Jacie

April 18, 2024

4 mins read

Hi, there 👋

We are currently working on a web app that uses AWS serverless architecture. A new feature come! We need to get a large CSV file periodically from an external API and store it in our database to provide for the front-end side and draw visual graphs for users. Because the maximum Lambda timeout is 15 minutes, we find it difficult to finish processing with over 1000000-row CSV files.

At that time, we came up with another solution - using Distributed Map in AWS Step Functions to iterate over a batch of 100 rows in the CSV file and pass it to the Lambda function to process.

In this blog, we will share how to implement the whole flow using AWS CDK v2**. For the overall solution, you can explore more** here.

Before starting, we will list out which resources we need to create:

A Step Functions state machine.
Execution role for the state machine. This role grants the permissions that your state machine needs to access other AWS services and resources such as the Lambda function's Invoke action.
A Lambda function to get the CSV file and store it in S3.
Execution role for the CSV generator Lambda function. This role grants the function permission to access other AWS services.
An Amazon S3 input bucket to store the output CSV file.
A Lambda function that processes the CSV data and stores it in our database.
Execution role for the process data Lambda function. This role grants the function permission to access other AWS services.

state-machine-design

Step 1: Create the state machine and provision resources

At first, we create an S3 bucket named atwarevn-s3-assets and a Lambda function named atwarevn-lambda-get-csv-file to get a CSV file from an external API and store it in that S3. This function will return an object of 2 attributes, bucketName and objectKey. For example:

{
  bucketName: "atwarevn-s3-assets",
  objectKey: "customers-1000000.csv"
}

In the next step, we create the second Lambda function named atwarevn-lambda-process-csv-data to process CSV data and store them in our database. At that time, we're not sure about the input for this Lambda so we initialize it as a simple "HelloWorld" function in the Lambda function handler.

Next, we start to create the state machine.

This state machine needs an execution role that allows it to invoke the two Lambda functions, read data from the S3 bucket, and start execution from other services.

new StateMachine(
        this,
        'process-csv-state-machine',
        {
stateMachineName: `atwarevn-state-machine-process-csv`,
role: processCsvStateMachineRole,
definition: new LambdaInvoke(this, 'invoke-get-csv-file', {
            lambdaFunction: getCsvFileLambda.lambdaFunction,
            inputPath: '$',
          }).next(
              new CustomState(this, 'process-csv-data', {
                stateJson: {
                  Type: 'Map',
                  ItemProcessor: {
                    ProcessorConfig: {
                      Mode: 'DISTRIBUTED',
                      ExecutionType: 'STANDARD',
                    },
                    StartAt: 'Lambda Invoke',
                    States: {
                      'Lambda Invoke': {
                        Type: 'Task',
                        Resource: 'arn:aws:states:::lambda:invoke',
                        OutputPath: '$.Payload',
                        Parameters: {
                          'Payload.$': '$',
                          FunctionName:
                               processCsvDataLambda.lambdaFunction,
                        },
                        Retry: [
                          {
                            ErrorEquals: [
                              'Lambda.ServiceException',
                              'Lambda.AWSLambdaException',
                              'Lambda.SdkClientException',
                              'Lambda.TooManyRequestsException',
                            ],
                            IntervalSeconds: 1,
                            MaxAttempts: 3,
                            BackoffRate: 2,
                          },
                        ],
                        End: true,
                        InputPath: '$',
                      },
                    },
                  },
                  End: true,
                  Label: 'Map',
                  MaxConcurrency: 100,
                  ItemReader: {
                    Resource: 'arn:aws:states:::s3:getObject',
                    ReaderConfig: {
                      InputType: 'CSV',
                      CSVHeaderLocation: 'FIRST_ROW',
                    },
                    Parameters: {
                      'Bucket.$': '$.Payload.bucketName',
                      'Key.$': '$.Payload.objectKey',
                    },
                  },
                  ItemBatcher: {
                    MaxItemsPerBatch: 100,
                  },
                  ToleratedFailurePercentage: 30,
                },
              }),
          ),
        },
    )

To get this definition for the state machine, you need to use AWS Console to design, then convert it to code and put it in stateJson object.

Everything is ready now! Let's deploy and test it!

Step 2. Start execution and test.

Go to AWS Console and start an execution at your state machine.

After it is executed successfully, you can see it divided into a lot of batches, each one will run with 100 items. Therefore, the input for the process CSV Lambda function will be like the image below. Now we can continue to implement the logic for the Lambda function handler.

csv data input

Conclusion

That's our solution when processing large CSV files using AWS Serverless architecture and some codes in AWS CDK v2 that we hope can help you in your case. Because we need to process periodically, we will add an Event Bridge rule to trigger this state machine. If you want to discuss further, feel free to contact us via contact@atware.asia. Thank you for reading 😊