PTDQ is a video system for AGA Amigas that provides a chunky-to-planar method which is faster than the traditional ones. It is the higher-quality brother of PTDS (formerly PED81C), another system based on the same core principle.
SIMPLIFIED COMPARISON CHART
----------------+------------+----------------+--------------+---------+------
| horizontal | maximum number | color choice | visual |
system | resolution | of colors | freedom | quality | speed
----------------+------------+----------------+--------------+---------+------
PTDQ | full | 256 | ** | ** | **
PTDS | half | 81 | * | * | ***
traditional C2P | full | 256 | *** | *** | *
[The video quality of the real machine output is heavily affected by the fact that the scandoubler did not support SHRES (so a real-time software trick was used to somehow produce the colors, although it is only a visual illusion and causes a sort of rasterline effect), the monitor did not support progressive PAL and the video was captured with an ancient phone at just 24.917 Hz. YouTube's compression degraded the video quality.]
Full details are provided in documentation included in the archive that can be downloaded from https://retream.itch.io/ptdq.
RETREAM - retro dreams for Amiga, Commodore 64 and PC
hypothetically could this be used in an operating system app, especially, a web browser that has an off screen buffer where the html/css has been rendered into a chunky bitmap, and then the actual browser needs to blit sections of that into the real screen buffer to show the rendering in a window
It looks really great and fast :) but what games profit from it ? and how it works ? can i just load this, install it and start like Wolf3D to get more fps ?
This video shows the Amiga AGA chipset combining 3 full screen 8-bit layers (or playfields, if you prefer) using various 8-bit alpha values.
LAYERS
Background: * PTDQ system * 320x200 dots * max 256 colors
Middleground: * PTDS system * 160x200 logical dots, 319x200 physical dots * max 16 non-transparent colors * each base color can have an arbitrary 8-bit alpha (actually used: 0 for complete transparency, 128 for dark colors, 255 for bright colors) * "native" chunky dots (i.e. each byte in the layer buffer corresponds to a dot) * triple buffer
Foreground: * PTDQ system * 320x200 dots * max 81 non-transparent colors * each base color can have an arbitrary 8-bit alpha (actually used: 0 for complete transparency, 192 for see-through graphics, 255 for solid graphics)
NOTES
* The color model is RGBW for all the layers, but each layer could use a color model of its own without making any difference performance-wise. * If the middleground had used PTDQ, its size would have been 320x200 dots and its maximum number of non-transparent colors would have been 81. However, that would have required the PTDQ C2P conversion. * If the middleground did not use 100% transparent dots, its maximum number of colors would have been 81. * If the foreground did not use 100% transparent dots, its maximum number of colors would have been 256. * The display size is actually 319x200 dots to hide the leftmost column of dots, as PTDS requires a 1-dot shift to the right for the even bitplanes. * The ball is rendered by scaling and flipping in real time a 128x128 chunky bitmap. * The ball is wiped by means of both CPU and Blitter. The logic that handles the geometry still needs to be refined in order to provide a massive speedup. * For convenience, the video has been recorded with WinUAE. * On a stock Amiga 1200, the demo runs at 50 fps except when the ball covers most of the screen (in that case, the frame rate drops proportionally to the size of the ball); the slowdowns will be greatly reduced once the wiping is optimized. * On an accelerated Amiga, the demo runs at steady 50 fps. * YouTube's encoding reduced the saturation of colors.
Time for a new demo. Also this one shows the AGA chipset combining 3 full screen 8-bit layers (or playfields, if you prefer) using various 8-bit alpha values. It isn't available for download yet as I have to finish to write some more technical details.
LAYERS
Common: * 320x256 dots * PTDQ system * RGBWa color model
Background: * maximum 256 colors * horizontal scrolling
Middleground: * maximum 256 colors * colors use 8-bit alpha values for the fade-in and cross-fading effects * vertical scrolling
Foreground: * maximum 81 non-transparent colors * triple buffer
NOTES
* All the layers reside in CHIP RAM. * If the foreground did not use 100% transparent dots, its maximum number of colors would have been 256. * The horizontal scrolling requires the background layer to be at least twice as wide as the screen plus 32 dots (i.e. 16 bytes, which in all cause a waste of 16*256 = 4096 bytes of CHIP RAM). Making the layer additional D dots wider (where D is dividable by 64) would waste further D/4 bytes per line of CHIP RAM. * The background layer can be made to scroll also vertically without problems. * The Thalion logo is scaled in real time on a chunky raster which gets written to the foreground layer by means of the PTDQ C2P conversion routine PTDQ_DoC2P_R(). * The 24-bit palettes for the fade-in and cross-fading effects are pre-calculated at startup; the effects are obtained by writing each frame a whole palette to the COLORxx registers with the CPU during the vertical blanking. * The music is a tracker module played by means of P6112. * On a stock Amiga 1200, the demo runs at 50 fps. * YouTube's encoding degraded the quality.
Edited by saimo on 2025/9/21 21:13:54 Edited by saimo on 2025/9/21 21:40:09
RETREAM - retro dreams for Amiga, Commodore 64 and PC
I have uploaded the demo now. You can get it from https://retream.itch.io/ptdq. The delay was due to the fact that I wanted to add the following section to the documentation.
PERFORMANCE
The following calculations evaluate the performance on Amiga without FAST RAM.
Given that the screen is 1280 SHRES pixels wide, 256 lines tall and 8 bitplanes
deep, the amount of bytes fetched each frame by the bitplanes DMA is
1280/8*256*8 = 327680. However, thanks to the AGA 64-bit data fetch, the number
of reads per frame are 327680/8 = 40960.
The pulse effect applied to the Thalion logo redraws every frame an area of
64x46 dots, which amount to 64*46 = 2944 bytes (as each dots corresponds to a
byte).
Given that the scaling reads and writes the dots by byte (from many tests, it
turned out to be the fastest solution on a stock Amiga 1200), it needs 2944*2 =
5888 memory accesses. Such accesses are performed one after another, without
other instructions causing delays. However, the CPU is never granted access to
the CHIP bus twice in a row, so the accesses actually take 5888*2 = 11776 color
clocks.
After the scaling, the data needs to be read, C2P-converted and then written to
the foreground bitplanes, for a total of 2944*2 = 5888 bytes. To convert 32
bytes, PTDQ_DoC2P_R() performs 9 memory accesses in parallel with other
operations and 7 accesses right after another access, thus needing 9+7*2 = 23
color clocks. Therefore, ignoring for simplicity a marginal overhead, it needs
(5888/32)*23 = 4232 color clocks.
In all, the pulse effects needs 11776+4232 = 16008 color clocks.
When a fade-in or cross-fading effect takes place, the CPU also writes 24-bit
values to all the COLORxx registers. So it needs to
* read 256*4 = 1024 bytes from CHIP RAM,
* write 256*4 = 1024 bytes to COLORxx and
* perform 8*2 = 16 writes to BPLCON3.
Both reads and writes are made by longword (thus setting 2 COLORxx registers at
a time), however the writes internally execute as two word writes, for a total
of 3 accesses per 2 COLORxx registers. Given that all these accesses are
consecutive, they take 3*2 = 6 color clocks. The fact that the high and the low
low order bits of the colors have to be written separately means that for each
color 2 accesses are needed, thus cancelling the avantage of processing the
colors by longword. As a result 256*6 = 1536 color clocks are needed.
The writes to BPLCON3 are spaced a little bit by means of a few instructions,
so, for simplicity, it can be assumed that each of them needs 1 color clock.
In all, a fade-in or cross-fading effect needs 1536+16 = 1552 color clocks.
When the staggered lines option is on, the Copper alters BPLCON1 once per line
by means of a WAIT and a MOVE instruction, which need 4 memory accesses, for a
total of 256*4 = 1024 additional accesses. Moreover, due to the horizontal
scrolling, the CPU needs to update the values of the MOVE instructions in the
Copperlist, performing 256 consecutive writes which require 256*2 = 512 color
clocks.
In all, the staggered lines need 1024+512 = 1536 color clocks.
(Note: using a Copper loop to have just two MOVE instructions that can be
quickly modified by the CPU does not provide any speed benefit as each jump,
executed by a MOVE to COPJMP2, would require 2 more reads, thus cancelling the
gain on the CPU side.)
The demo also plays music, which requires some color clocks for the audio data
DMA. Due to how music playback works, there is no simple way to calculate its
load on the CHIP bus. However, it is possible to calculate the load in a very
heavy case (and certainly worse than the actual one) by assuming that all the 4
channels play continuously at almost the maximum frequency allowed, i.e., for
simplicity, 28500 Hz. That means that 28500*4 = 114000 bytes need to be read per
second, i.e. about 114000/50 = 2280 bytes per frame. Given that the data is
fetched by words, those amount to 2280/2 = 1140 memory accesses per frame.
(Note: if the quality were half of the one considered - which is closer to
reality - the accesses would be 1140/2 = 570, but such value would not make much
difference anyway.)
Putting all the figures together, the total number of color clocks needed is
40960+16008+1552+1140 = 59660 if the staggered lines are off and 59660+1536 =
61196 if the staggered lines are on. Considering that a PAL Amiga has 313*227 =
71051 color clocks per (long) frame, those figures represent respectively
100*59660/71051 = 83.8% and 100*61196/71051 = 86.1% of the color clocks
available in a frame.
It might seem that there is some room to perform other operations, but the above
calculations do not take into account the reads performed by the CPU to fetch
the instructions (which do occur often as the 68020 has only a tiny instruction
cache of 256 bytes); moreover, the CPU is not just reading and writing data, but
also performing other operations (which only partially overlap with writes).
Therefore, unfortunately, there is not really much more that can be done in a
frame time.
RETREAM - retro dreams for Amiga, Commodore 64 and PC