Optimizing AWS Glue Iceberg Table v3 Spec
As of 2025.10.28, AWS Glue Iceberg tables support up to format-version=2, but do not support format-version=3.
That said, you can still store and use data as Iceberg format-version=3.
While there are issues such as not being able to run queries from Athena, you can use it as format-version=3 and benefit from the table v3 spec.
If you manage AWS Glue tables in Iceberg v3 format and reference the data from SaaS offerings such as Databricks, you can expect a significant performance improvement.
However, the AWS Glue table optimization feature errors out when format-version=3.
Therefore, you need to perform table optimization using Spark SQL or similar in a Glue Job.
Here is an example.
Example Glue Job Python Script
1 | from pyspark.context import SparkContext |
The Glue Job Python script performs the following, which are equivalent to the Glue managed optimization feature.
- Compaction
- Snapshot retention
- Orphan file deletion
Glue Job Configuration
To use the Iceberg table v3 spec, you need iceberg-spark version 1.10.0 or higher.
- Upload iceberg-spark-runtime-3.5_2.12-1.10.0.jar to S3.
- Specify the S3 URI of the uploaded file with
--extra-jarsin the Job parameters. - Add
--conf spark.jars.packages=org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.0to the Spark Conf setting in the Job parameters.- If you specify the Spark Conf settings within your script, add it there instead.
- Add
--user-jars-firstset totruein the Job parameters.
Reference: Using a custom Iceberg version
With this, you can use the Iceberg table v3 spec in a Glue Job.
Verifying That Optimization Was Performed
1 | # Rewrite Data Files (compaction) |
Since the return value of each optimization Spark SQL call is a DataFrame, displaying its contents shows which files were deleted, and so on.
Checking from the S3 Objects Where Metadata Is Stored
You can check from the table metadata stored in S3 (<table>/metadata/*.metadata.json).
Iceberg internally places JSON files in the metadata directory.
1 | metadata/ |
- Iceberg creates a new metadata.json on each table operation (INSERT, DELETE, COMPACT, etc.)
- The higher the number, the more recent
Since this metadata.json records the table’s state at each point in time, you can refer to it to verify whether optimization was performed.
Verifying Whether Compaction Was Performed
Below is an example of a metadata.json.
metadata.json
1 | ... |
The higher the sequence-number, the more recent.
After compaction, the operation becomes replace or rewrite.
total-data-fileswent from 20 → 1, so the number of files has been reducedtotal-files-sizewent from 29693801 → 28252807, which is nearly the samedeleted-data-files(deleted data files) is 20removed-files-size(size of deleted files) is 29693801, the same as the size of the managed files at sequence-number=10
From the above, you can see that the total file size is virtually unchanged, the previous data was entirely deleted, and a single new file was generated.
The compaction appears to have been performed without issues.
- Databricks supports the
DESCRIBE HISTORYquery, which lets you check the operation history of a table, so you can verify it from there as well.
That’s all.
I hope you find this helpful.
Optimizing AWS Glue Iceberg Table v3 Spec
https://kenzo0107.github.io/en/2025/10/28/aws-glue-iceberg-v3-optimization/
