Redshift copy creates different compression encodings from analyze

amazon-redshift amazon-s3 data-compression sql sqlbulkcopy

Question

I've discovered that when importing data (through COPY) to an empty table, AWS Redshift suggests different column compression encodings than the ones that it automatically generates.

As an example, I constructed a table and did the following while loading data from S3:

CREATE TABLE Client (Id varchar(511) , ClientId integer , CreatedOn timestamp, 
UpdatedOn timestamp ,  DeletedOn timestamp , LockVersion integer , RegionId 
varchar(511) , OfficeId varchar(511) , CountryId varchar(511) ,  
FirstContactDate timestamp , DidExistPre boolean , IsActive boolean , 
StatusReason integer ,  CreatedById varchar(511) , IsLocked boolean , 
LockType integer , KeyWorker varchar(511) ,  InactiveDate timestamp , 
Current_Flag varchar(511) );

Table Client created Execution time: 0.3s

copy Client from 's3://<bucket-name>/<folder>/Client.csv' 
credentials 'aws_access_key_id=<access key>; aws_secret_access_key=<secret>' 
csv fillrecord truncatecolumns ignoreheader 1 timeformat as 'YYYY-MM-
DDTHH:MI:SS' gzip acceptinvchars compupdate on region 'ap-southeast-2';    

Warnings: Load into table 'client' completed, 24284 record(s) loaded successfully. Load into table 'client' completed, 6 record(s) were loaded with replacements made for ACCEPTINVCHARS. Check 'stl_replacements' system table for details.

0 rows affected COPY executed successfully

Execution time: 3.39s

After doing this, I can examine the column compression encodings that COPY has used:

select "column", type, encoding, distkey, sortkey, "notnull" 
from pg_table_def where tablename = 'client';

Giving:

╔══════════════════╦═════════════════════════════╦═══════╦═══════╦═══╦═══════╗
â•‘ id               â•‘ character varying(511)      â•‘ lzo   â•‘ false â•‘ 0 â•‘ false â•‘
â•‘ clientid         â•‘ integer                     â•‘ delta â•‘ false â•‘ 0 â•‘ false â•‘
â•‘ createdon        â•‘ timestamp without time zone â•‘ lzo   â•‘ false â•‘ 0 â•‘ false â•‘
â•‘ updatedon        â•‘ timestamp without time zone â•‘ lzo   â•‘ false â•‘ 0 â•‘ false â•‘
â•‘ deletedon        â•‘ timestamp without time zone â•‘ none  â•‘ false â•‘ 0 â•‘ false â•‘
â•‘ lockversion      â•‘ integer                     â•‘ delta â•‘ false â•‘ 0 â•‘ false â•‘
â•‘ regionid         â•‘ character varying(511)      â•‘ lzo   â•‘ false â•‘ 0 â•‘ false â•‘
â•‘ officeid         â•‘ character varying(511)      â•‘ lzo   â•‘ false â•‘ 0 â•‘ false â•‘
â•‘ countryid        â•‘ character varying(511)      â•‘ lzo   â•‘ false â•‘ 0 â•‘ false â•‘
â•‘ firstcontactdate â•‘ timestamp without time zone â•‘ lzo   â•‘ false â•‘ 0 â•‘ false â•‘
â•‘ didexistprecirts â•‘ boolean                     â•‘ none  â•‘ false â•‘ 0 â•‘ false â•‘
â•‘ isactive         â•‘ boolean                     â•‘ none  â•‘ false â•‘ 0 â•‘ false â•‘
â•‘ statusreason     â•‘ integer                     â•‘ none  â•‘ false â•‘ 0 â•‘ false â•‘
â•‘ createdbyid      â•‘ character varying(511)      â•‘ lzo   â•‘ false â•‘ 0 â•‘ false â•‘
â•‘ islocked         â•‘ boolean                     â•‘ none  â•‘ false â•‘ 0 â•‘ false â•‘
â•‘ locktype         â•‘ integer                     â•‘ lzo   â•‘ false â•‘ 0 â•‘ false â•‘
â•‘ keyworker        â•‘ character varying(511)      â•‘ lzo   â•‘ false â•‘ 0 â•‘ false â•‘
â•‘ inactivedate     â•‘ timestamp without time zone â•‘ lzo   â•‘ false â•‘ 0 â•‘ false â•‘
â•‘ current_flag     â•‘ character varying(511)      â•‘ lzo   â•‘ false â•‘ 0 â•‘ false â•‘
╚══════════════════╩═════════════════════════════╩═══════╩═══════╩═══╩═══════╝

Then, I can:

analyze compression client;

Giving:

╔════════╦══════════════════╦═══════╦═══════╗
â•‘ client â•‘ id               â•‘ zstd  â•‘ 40.59 â•‘
â•‘ client â•‘ clientid         â•‘ delta â•‘ 0.00  â•‘
â•‘ client â•‘ createdon        â•‘ zstd  â•‘ 19.85 â•‘
â•‘ client â•‘ updatedon        â•‘ zstd  â•‘ 12.59 â•‘
â•‘ client â•‘ deletedon        â•‘ raw   â•‘ 0.00  â•‘
â•‘ client â•‘ lockversion      â•‘ zstd  â•‘ 39.12 â•‘
â•‘ client â•‘ regionid         â•‘ zstd  â•‘ 54.47 â•‘
â•‘ client â•‘ officeid         â•‘ zstd  â•‘ 88.84 â•‘
â•‘ client â•‘ countryid        â•‘ zstd  â•‘ 79.13 â•‘
â•‘ client â•‘ firstcontactdate â•‘ zstd  â•‘ 22.31 â•‘
â•‘ client â•‘ didexistprecirts â•‘ raw   â•‘ 0.00  â•‘
â•‘ client â•‘ isactive         â•‘ raw   â•‘ 0.00  â•‘
â•‘ client â•‘ statusreason     â•‘ raw   â•‘ 0.00  â•‘
â•‘ client â•‘ createdbyid      â•‘ zstd  â•‘ 52.43 â•‘
â•‘ client â•‘ islocked         â•‘ raw   â•‘ 0.00  â•‘
â•‘ client â•‘ locktype         â•‘ zstd  â•‘ 63.01 â•‘
â•‘ client â•‘ keyworker        â•‘ zstd  â•‘ 38.79 â•‘
â•‘ client â•‘ inactivedate     â•‘ zstd  â•‘ 25.40 â•‘
â•‘ client â•‘ current_flag     â•‘ zstd  â•‘ 90.51 â•‘
╚════════╩══════════════════╩═══════╩═══════╝

i.e., very different outcomes.

I'm curious as to why this would be. I understand that a table with 24K rows is fewer than the 100K records recommended by AWS outlines for a useful compression analysis sample, but it still seems odd that COPY and ANALYZE are returning different findings for the same data.

1
2
7/14/2017 1:38:16 AM

Accepted Answer

Because ZSTD is not presently suggested by COPY, other compression parameters are advised.

Setting ZSTD uniformly will offer you compression that is almost as good as ideal if you wish to utilize compression on permanent tables where you want to optimize compression (use the least amount of space).

Because there is no benefit to applying compression in this situation, RAW is resurfacing on certain columns (same number of blocks with and without compression). It makes sense to apply compression to such columns if you are aware that the table will expand.

2
8/30/2018 9:04:51 PM


Related Questions





Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow