Redshift copy creates different compression encodings from analyze

amazon-redshift amazon-s3 data-compression sql sqlbulkcopy

Question

I've noticed that AWS Redshift recommends different column compression encodings from the ones that it automatically creates when loading data (via COPY) to an empty table.

For example, I have created a table and loaded data from S3 as follows:

CREATE TABLE Client (Id varchar(511) , ClientId integer , CreatedOn timestamp, 
UpdatedOn timestamp ,  DeletedOn timestamp , LockVersion integer , RegionId 
varchar(511) , OfficeId varchar(511) , CountryId varchar(511) ,  
FirstContactDate timestamp , DidExistPre boolean , IsActive boolean , 
StatusReason integer ,  CreatedById varchar(511) , IsLocked boolean , 
LockType integer , KeyWorker varchar(511) ,  InactiveDate timestamp , 
Current_Flag varchar(511) );

Table Client created Execution time: 0.3s

copy Client from 's3://<bucket-name>/<folder>/Client.csv' 
credentials 'aws_access_key_id=<access key>; aws_secret_access_key=<secret>' 
csv fillrecord truncatecolumns ignoreheader 1 timeformat as 'YYYY-MM-
DDTHH:MI:SS' gzip acceptinvchars compupdate on region 'ap-southeast-2';    

Warnings: Load into table 'client' completed, 24284 record(s) loaded successfully. Load into table 'client' completed, 6 record(s) were loaded with replacements made for ACCEPTINVCHARS. Check 'stl_replacements' system table for details.

0 rows affected COPY executed successfully

Execution time: 3.39s

Having done this I can look at the column compression encodings that have been applied by COPY:

select "column", type, encoding, distkey, sortkey, "notnull" 
from pg_table_def where tablename = 'client';

Giving:

╔══════════════════╦═════════════════════════════╦═══════╦═══════╦═══╦═══════╗
â•‘ id               â•‘ character varying(511)      â•‘ lzo   â•‘ false â•‘ 0 â•‘ false â•‘
â•‘ clientid         â•‘ integer                     â•‘ delta â•‘ false â•‘ 0 â•‘ false â•‘
â•‘ createdon        â•‘ timestamp without time zone â•‘ lzo   â•‘ false â•‘ 0 â•‘ false â•‘
â•‘ updatedon        â•‘ timestamp without time zone â•‘ lzo   â•‘ false â•‘ 0 â•‘ false â•‘
â•‘ deletedon        â•‘ timestamp without time zone â•‘ none  â•‘ false â•‘ 0 â•‘ false â•‘
â•‘ lockversion      â•‘ integer                     â•‘ delta â•‘ false â•‘ 0 â•‘ false â•‘
â•‘ regionid         â•‘ character varying(511)      â•‘ lzo   â•‘ false â•‘ 0 â•‘ false â•‘
â•‘ officeid         â•‘ character varying(511)      â•‘ lzo   â•‘ false â•‘ 0 â•‘ false â•‘
â•‘ countryid        â•‘ character varying(511)      â•‘ lzo   â•‘ false â•‘ 0 â•‘ false â•‘
â•‘ firstcontactdate â•‘ timestamp without time zone â•‘ lzo   â•‘ false â•‘ 0 â•‘ false â•‘
â•‘ didexistprecirts â•‘ boolean                     â•‘ none  â•‘ false â•‘ 0 â•‘ false â•‘
â•‘ isactive         â•‘ boolean                     â•‘ none  â•‘ false â•‘ 0 â•‘ false â•‘
â•‘ statusreason     â•‘ integer                     â•‘ none  â•‘ false â•‘ 0 â•‘ false â•‘
â•‘ createdbyid      â•‘ character varying(511)      â•‘ lzo   â•‘ false â•‘ 0 â•‘ false â•‘
â•‘ islocked         â•‘ boolean                     â•‘ none  â•‘ false â•‘ 0 â•‘ false â•‘
â•‘ locktype         â•‘ integer                     â•‘ lzo   â•‘ false â•‘ 0 â•‘ false â•‘
â•‘ keyworker        â•‘ character varying(511)      â•‘ lzo   â•‘ false â•‘ 0 â•‘ false â•‘
â•‘ inactivedate     â•‘ timestamp without time zone â•‘ lzo   â•‘ false â•‘ 0 â•‘ false â•‘
â•‘ current_flag     â•‘ character varying(511)      â•‘ lzo   â•‘ false â•‘ 0 â•‘ false â•‘
╚══════════════════╩═════════════════════════════╩═══════╩═══════╩═══╩═══════╝

I can then do:

analyze compression client;

Giving:

╔════════╦══════════════════╦═══════╦═══════╗
â•‘ client â•‘ id               â•‘ zstd  â•‘ 40.59 â•‘
â•‘ client â•‘ clientid         â•‘ delta â•‘ 0.00  â•‘
â•‘ client â•‘ createdon        â•‘ zstd  â•‘ 19.85 â•‘
â•‘ client â•‘ updatedon        â•‘ zstd  â•‘ 12.59 â•‘
â•‘ client â•‘ deletedon        â•‘ raw   â•‘ 0.00  â•‘
â•‘ client â•‘ lockversion      â•‘ zstd  â•‘ 39.12 â•‘
â•‘ client â•‘ regionid         â•‘ zstd  â•‘ 54.47 â•‘
â•‘ client â•‘ officeid         â•‘ zstd  â•‘ 88.84 â•‘
â•‘ client â•‘ countryid        â•‘ zstd  â•‘ 79.13 â•‘
â•‘ client â•‘ firstcontactdate â•‘ zstd  â•‘ 22.31 â•‘
â•‘ client â•‘ didexistprecirts â•‘ raw   â•‘ 0.00  â•‘
â•‘ client â•‘ isactive         â•‘ raw   â•‘ 0.00  â•‘
â•‘ client â•‘ statusreason     â•‘ raw   â•‘ 0.00  â•‘
â•‘ client â•‘ createdbyid      â•‘ zstd  â•‘ 52.43 â•‘
â•‘ client â•‘ islocked         â•‘ raw   â•‘ 0.00  â•‘
â•‘ client â•‘ locktype         â•‘ zstd  â•‘ 63.01 â•‘
â•‘ client â•‘ keyworker        â•‘ zstd  â•‘ 38.79 â•‘
â•‘ client â•‘ inactivedate     â•‘ zstd  â•‘ 25.40 â•‘
â•‘ client â•‘ current_flag     â•‘ zstd  â•‘ 90.51 â•‘
╚════════╩══════════════════╩═══════╩═══════╝

i.e. quite different results.

I'm keen to know why this might be? I get that ~24K records are less than the 100K that AWS specifies as being required for a meaningful compression analysis sample, however it still seems strange that COPY and ANALYZE are giving different results for the same 24K row table.

Accepted Answer

COPY doesn't currently recommend ZSTD which is why the recommended compression settings are different.

If you're looking to apply compression on permanent tables where you want to maximize compression (use least amount of space), setting ZSTD across the board will give you close to optimal compression.

The reason RAW is coming back on some columns is because in this case there is no advantage to applying compression (same number of blocks with and without compression). If you know table will be growing it makes sense to apply compression to those columns as well.



Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why