knowledge-database (beta)

Current group: comp.arch

RISC vs. CISC design principles

RISC vs. CISC design principles  
Paul A. Clayton
 Re: Unaligned accesses (was Re: RISC vs. CISC design principles)  
MrTibbs
 Re: Unaligned accesses (was Re: RISC vs. CISC design principles)  
John Savard
 Re: Unaligned accesses (was Re: RISC vs. CISC design principles)  
Seongbae Park
 Re: Unaligned accesses (was Re: RISC vs. CISC design principles)  
Terje Mathisen
 Re: Unaligned accesses (was Re: RISC vs. CISC design principles)  
Eric P.
 Re: Unaligned accesses (was Re: RISC vs. CISC design principles)  
Terje Mathisen
 Re: Unaligned accesses (was Re: RISC vs. CISC design principles)  
Eric P.
 Re: RISC vs. CISC design principles  
MitchAlsup at aol.com
 Re: RISC vs. CISC design principles  
Terje Mathisen
 Re: RISC vs. CISC design principles  
already5chosen at yahoo.com
 Re: RISC vs. CISC design principles  
MitchAlsup at aol.com
 Re: RISC vs. CISC design principles  
Paul Rubin
 Re: Unaligned accesses  
prep at prep.synonet.com
 Re: Unaligned accesses  
Terje Mathisen
 Re: Unaligned accesses  
Wilco Dijkstra
 Re: RISC vs. CISC design principles  
Paul A. Clayton
 Re: RISC vs. CISC design principles  
Stephen Fuld
 Re: RISC vs. CISC design principles  
Kai Harrekilde-Petersen
 Re: RISC vs. CISC design principles  
Nick Maclaren
 Re: RISC vs. CISC design principles  
Terje Mathisen
 Re: RISC vs. CISC design principles  
Nick Maclaren
 Re: RISC vs. CISC design principles  
Terje Mathisen
 Re: RISC vs. CISC design principles  
Nick Maclaren
 Re: RISC vs. CISC design principles  
Terje Mathisen
 Re: RISC vs. CISC design principles  
Stephen Fuld
 Re: RISC vs. CISC design principles  
Nick Maclaren
 Re: RISC vs. CISC design principles  
Stephen Fuld
 Re: RISC vs. CISC design principles  
Nick Maclaren
 Unaligned accesses (was Re: RISC vs. CISC design principles)  
Maynard Handley
 Re: Unaligned accesses (was Re: RISC vs. CISC design principles)  
Andrew Reilly
 Re: Unaligned accesses (was Re: RISC vs. CISC design principles)  
Maynard Handley
 Re: Unaligned accesses (was Re: RISC vs. CISC design principles)  
Andrew Reilly
 Re: Unaligned accesses (was Re: RISC vs. CISC design principles)  
Christian Bau
 Re: Unaligned accesses (was Re: RISC vs. CISC design principles)  
Nick Maclaren
 Re: Unaligned accesses (was Re: RISC vs. CISC design principles)  
Nick Maclaren
 Re: Unaligned accesses (was Re: RISC vs. CISC design principles)  
John Savard
 Re: RISC vs. CISC design principles  
Bernd Paysan
 Re: RISC vs. CISC design principles  
Paul A. Clayton
 Re: RISC vs. CISC design principles  
Del Cecchi
 Re: RISC vs. CISC design principles  
Bernd Paysan
 Re: RISC vs. CISC design principles  
Del Cecchi
 Re: RISC vs. CISC design principles  
Andi Kleen
 Re: RISC vs. CISC design principles  
Nick Maclaren
 AltiVec vector permute (was: Re: Unaligned accesses)  
hobold
 Re: AltiVec vector permute  
Terje Mathisen
 Re: Unaligned accesses (was Re: RISC vs. CISC design principles)  
Eric P.
 Re: Unaligned accesses (was Re: RISC vs. CISC design principles)  
Terje Mathisen
 Re: Unaligned accesses (was Re: RISC vs. CISC design principles)  
MitchAlsup at aol.com
 Re: Unaligned accesses (was Re: RISC vs. CISC design principles)  
Terje Mathisen
 Re: Unaligned accesses (was Re: RISC vs. CISC design principles)  
Greg Lindahl
 Re: Unaligned accesses (was Re: RISC vs. CISC design principles)  
Terje Mathisen
 Re: Unaligned accesses (was Re: RISC vs. CISC design principles)  
Greg Lindahl
 Re: Unaligned accesses (was Re: RISC vs. CISC design principles)  
Christian Bau
 Re: Unaligned accesses (was Re: RISC vs. CISC design principles)  
Terje Mathisen
 Re: Unaligned accesses (was Re: RISC vs. CISC design principles)  
Greg Lindahl
 Re: Unaligned accesses (was Re: RISC vs. CISC design principles)  
Christian Bau
 Re: Unaligned accesses (was Re: RISC vs. CISC design principles)  
Terje Mathisen
 Re: Unaligned accesses (was Re: RISC vs. CISC design principles)  
MrTibbs
 Re: Unaligned accesses (was Re: RISC vs. CISC design principles)  
Christian Bau
 Re: Unaligned accesses (was Re: RISC vs. CISC design principles)  
David Wang
 Re: Unaligned accesses  
Niels_Jørgen_Kruse
 Re: Unaligned accesses  
Emil Naepflein
 Re: Unaligned accesses  
Niels_Jørgen_Kruse
 Re: Unaligned accesses  
Emil Naepflein
 Re: Unaligned accesses  
Nick Maclaren
 Re: Unaligned accesses  
Eric P.
 Re: Unaligned accesses  
Emil Naepflein
 Re: Unaligned accesses  
Nick Maclaren
 Re: Unaligned accesses  
Emil Naepflein
 Re: Unaligned accesses  
Terje Mathisen
 Re: Unaligned accesses (was Re: RISC vs. CISC design principles)  
Maynard Handley
 Re: RISC vs. CISC design principles  
Jan_Vorbrüggen
 Re: RISC vs. CISC design principles  
Paul A. Clayton
 Re: RISC vs. CISC design principles  
Stefan Monnier
 Re: RISC vs. CISC design principles  
Anton Ertl
 Re: RISC vs. CISC design principles  
Paul Rubin
 Re: RISC vs. CISC design principles  
Andi Kleen
 Re: RISC vs. CISC design principles  
MitchAlsup at aol.com
 Re: RISC vs. CISC design principles  
James Van Buskirk
 Re: RISC vs. CISC design principles  
D. J. Bernstein
 Re: RISC vs. CISC design principles  
James Van Buskirk
 Re: RISC vs. CISC design principles  
D. J. Bernstein
 Re: RISC vs. CISC design principles  
James Van Buskirk
 Re: RISC vs. CISC design principles  
Paul Rubin
 Re: RISC vs. CISC design principles  
Greg Lindahl
 Re: RISC vs. CISC design principles  
Paul Rubin
 Re: Unaligned accesses (was Re: RISC vs. CISC design principles)  
MitchAlsup at aol.com
 Re: Unaligned accesses (was Re: RISC vs. CISC design principles)  
Terje Mathisen
 Re: RISC vs. CISC design principles  
MitchAlsup at aol.com
 Re: RISC vs. CISC design principles  
James Van Buskirk
 Re: RISC vs. CISC design principles  
Paul Rubin
 Re: RISC vs. CISC design principles  
Christian Bau
 Re: RISC vs. CISC design principles  
Paul Rubin
 Re: RISC vs. CISC design principles  
Christian Bau
 Re: RISC vs. CISC design principles  
Anton Ertl
 Re: RISC vs. CISC design principles  
Terje Mathisen
 Re: RISC vs. CISC design principles  
Anton Ertl
 Re: RISC vs. CISC design principles  
Terje Mathisen
 Re: RISC vs. CISC design principles  
Anton Ertl
 Re: RISC vs. CISC design principles  
Terje Mathisen
 Re: RISC vs. CISC design principles  
Bernd Paysan
 Re: RISC vs. CISC design principles  
Terje Mathisen
 Re: RISC vs. CISC design principles  
Bernd Paysan
 Re: RISC vs. CISC design principles  
Terje Mathisen
 Re: RISC vs. CISC design principles  
Bernd Paysan
 Re: RISC vs. CISC design principles  
Terje Mathisen
 Re: RISC vs. CISC design principles  
Eugene Nalimov
 Re: RISC vs. CISC design principles  
Anton Ertl
 Re: RISC vs. CISC design principles  
Eugene Nalimov
 Re: RISC vs. CISC design principles  
Bernd Paysan
 Re: RISC vs. CISC design principles  
Greg Lindahl
 Re: RISC vs. CISC design principles  
James Van Buskirk
 Re: RISC vs. CISC design principles  
Paul Rubin
 Re: RISC vs. CISC design principles  
prep at prep.synonet.com
 Re: RISC vs. CISC design principles  
David Wang
 Re: RISC vs. CISC design principles  
Paul A. Clayton
From:Paul A. Clayton
Subject:RISC vs. CISC design principles
Date:12 Jan 2005 17:15:23 GMT
In the CISC vs. RISC debate, it seems that the design principles
behind each are generally not considered in their historical
context.

ISTM that a major concern for CISC was memory capacity. This
concern is expressed in efforts at code density (variable length
instructions, complex instructions [often targeting the
'common case' of higher level programming, meshing with semantic
gap theory], implicit arguments [leading to special-purpose
registers], and fewer registers [fewer bits to encode]), hardware
support of unaligned loads (to improve data density), and
finer-grained memory protection (segment-based rather
page-based). In earlier hardware the cost of unaligned loads may
have been smaller due to the smaller width of memory interfaces.
The cost of ROM relative to RAM (especially fast RAM) may have
tended to encourage the use of static micro-code even beyond the
general code density advantage. Earlier systems may also have
used fewer segments per application and less dynamic resizing,
possibly making segmentation more efficient.

Also earlier semantic gap theory may have seemed more reasonable,
making compilers (and assembly-level programming) simpler when
programming effort was perhaps a greater consideration (and as
mentioned above it meshes well with targeting code density).

OTOH, the main RISC design principles are pipelinability (leading
to fixed-sized instructions, few instruction formats, simpler
memory addressing modes, etc.), compiler optimization ('reduced'
operations can be independently scheduled, a relatively large
number of [fast] registers allows software to cache data [e.g.,
Common Subexpression Elimination] and use faster/less expensive
procedure interfaces), and simpler/faster hardware (in addition
to pipelinability aids, aligned memory accesses [which can
simplify a usually time-critical path] and 'reduced'
instructions [particularly separation of memory accesses from
other operations]).

The design principles of RISC place more burden on the compiler,
which may allow system developers to take advantage of
late-binding. It would certainly seem to allow system developers
to leverage a greater volume of software developers relative to
hardware developers.

At current hardware budgets, the aligned memory access
requirement is probably the least useful of the RISC mechanisms.
The largish number of general purpose registers may become more
burdensome if microthreading becomes more common, but the
benefits of this mechanism still seem to outweigh the
disadvantages significantly. With the exception of some embedded
systems (for which two-sized instructions are common),
fixed-sized instructions seem to provide more benefit than cost.
The emphasis on scalar decode may be considered a weakness of
RISC in the world of superscalar processing, though the simple
generally explicit encoding of RISC does help somewhat even
there.

CISC design principles may be said to depend too much on
expensive memory capacity to remain practical in most modern
circumstances.


Paul A. Clayton
(a 'Dysthymicdolt' reachable at aol.com)
From:MrTibbs
Subject:Re: Unaligned accesses (was Re: RISC vs. CISC design principles)
Date:15 Jan 2005 22:05:14 -0800
> Putting only a few extra gates on a chip to allow unaligned accesses,

I'm not so sure it's just a few gates. What if the unaligned access
crosses a cache line boundary, and one line is in the cache and one
isn't? What if it crosses a page boundary, and blah blah...

There's the MOESI/whatever protocol for multiprocessors as well.
Although few programs may do unaligned accesses on shared memory, it
has to work right if it is advertised.

It may or may not be a few gates, but I think the hardware folks, with
unaligned accesses, now have to deal with a whole bunch of corner cases
that they wouldn't be considered otherwise.

jim
From:John Savard
Subject:Re: Unaligned accesses (was Re: RISC vs. CISC design principles)
Date:Mon, 17 Jan 2005 00:25:25 GMT
On 15 Jan 2005 22:05:14 -0800, "MrTibbs" wrote,
in part:

>> Putting only a few extra gates on a chip to allow unaligned accesses,
>
>I'm not so sure it's just a few gates. What if the unaligned access
>crosses a cache line boundary, and one line is in the cache and one
>isn't? What if it crosses a page boundary, and blah blah...

You make it into a few gates by turning an unaligned access into
multiple accesses of smaller things. If you want a smaller performance
penalty, *then* it's more gates.

John Savard
http://home.ecn.ab.ca/~jsavard/index.html
From:Seongbae Park
Subject:Re: Unaligned accesses (was Re: RISC vs. CISC design principles)
Date:Tue, 18 Jan 2005 16:58:46 +0000 (UTC)
John Savard wrote:
> On 15 Jan 2005 22:05:14 -0800, "MrTibbs" wrote,
> in part:
>
>>> Putting only a few extra gates on a chip to allow unaligned accesses,
>>
>>I'm not so sure it's just a few gates. What if the unaligned access
>>crosses a cache line boundary, and one line is in the cache and one
>>isn't? What if it crosses a page boundary, and blah blah...
>
> You make it into a few gates by turning an unaligned access into
> multiple accesses of smaller things.

You can't simply turn it into multiple smaller accesses
without locking multiple cache lines (or potentially even TLB entries
if it crosses page boundary)
if the ISA defines the memory operations to be atomic (most ISAs do).
Locking multiple anything will cost more than "just a few gates"
if otherwise you don't need to do so.
--
#pragma ident "Seongbae Park, compiler, http://blogs.sun.com/seongbae/"
From:Terje Mathisen
Subject:Re: Unaligned accesses (was Re: RISC vs. CISC design principles)
Date:Tue, 18 Jan 2005 21:07:12 +0100
Seongbae Park wrote:

> John Savard wrote:
>
>>You make it into a few gates by turning an unaligned access into
>>multiple accesses of smaller things.
>
> You can't simply turn it into multiple smaller accesses
> without locking multiple cache lines (or potentially even TLB entries
> if it crosses page boundary)
> if the ISA defines the memory operations to be atomic (most ISAs do).
> Locking multiple anything will cost more than "just a few gates"
> if otherwise you don't need to do so.

In the cases we've been discussing allowing mis-aligned accesses to be
not atomic wouldn't cost anything at all:

After all this is what the alternative sequence have to do anyway, right?

I.e. I'd be perfectly happy with a "best effort" alignment handler in hw:

Load a single item (quickly) if aligned, otherwise load two items into
the barrel shifter, shift to align, and return the result.

This would be at least comparable to an explicit sw sequence to do the
same task, and it would simplify programming quite a bit.

(I.e. aligned writes and misaligned reades are nearly the same speed as
having both aligned on most x86 implementations!)

Using a LOCK prefix should trap in such a case.

Terje

--
-
"almost all programming can be viewed as an exercise in caching"
From:Eric P.
Subject:Re: Unaligned accesses (was Re: RISC vs. CISC design principles)
Date:Tue, 18 Jan 2005 16:01:22 -0500
Terje Mathisen wrote:
>
> Seongbae Park wrote:
>
> > John Savard wrote:
> >
> >>You make it into a few gates by turning an unaligned access into
> >>multiple accesses of smaller things.
> >
> > You can't simply turn it into multiple smaller accesses
> > without locking multiple cache lines (or potentially even TLB entries
> > if it crosses page boundary)
> > if the ISA defines the memory operations to be atomic (most ISAs do).
> > Locking multiple anything will cost more than "just a few gates"
> > if otherwise you don't need to do so.
>
> In the cases we've been discussing allowing mis-aligned accesses to be
> not atomic wouldn't cost anything at all:

Note that the Intel x86 does NOT guarantee atomic access to
nonaligned values that straddle 32 byte cache lines.
(Vol 3, Sys Prog Guide, section 7.1.1)

> After all this is what the alternative sequence have to do anyway, right?
>
> I.e. I'd be perfectly happy with a "best effort" alignment handler in hw:
>
> Load a single item (quickly) if aligned, otherwise load two items into
> the barrel shifter, shift to align, and return the result.

Most of this hw support would likely already be present in the L1 data
cache as it is required for byte and aligned word/dword/qword access.
Nonaligned access should require only minor extensions.

> This would be at least comparable to an explicit sw sequence to do the
> same task, and it would simplify programming quite a bit.

The sw trap incurs a pipeline flush that a hw sequencer does not.

> (I.e. aligned writes and misaligned reades are nearly the same speed as
> having both aligned on most x86 implementations!)
>
> Using a LOCK prefix should trap in such a case.

Hmmm... what else might might be affected?

- Load-Store queue must do more complex overlap checks before
allowing read or write reordering

- On store operations that straddle pages, MMU must probe TLB for
both pages before starting so they do not fault half way through.
If both are valid then emit physical addresses to L1.

- Write combine buffer must do more complex check for straddles.
Also must try not to evict one needed part when loading another.

Anything else?

Eric
From:Terje Mathisen
Subject:Re: Unaligned accesses (was Re: RISC vs. CISC design principles)
Date:Wed, 19 Jan 2005 08:35:22 +0100
Eric P. wrote:

> Terje Mathisen wrote:
> Hmmm... what else might might be affected?
>
> - Load-Store queue must do more complex overlap checks before
> allowing read or write reordering

Not too much though: Currently it must take into consideration both base
and length of each operations, this extension could conservatively
extend this to be the aligned base, and the extended length.
>
> - On store operations that straddle pages, MMU must probe TLB for
> both pages before starting so they do not fault half way through.
> If both are valid then emit physical addresses to L1.
>
> - Write combine buffer must do more complex check for straddles.
> Also must try not to evict one needed part when loading another.

None of these would seem to apply if the store that crosses a cache line
boundary is turned into multiple micro-ops, with traps allowed between
them. I.e. in case of a store that traps halfway, the first half could
get written either once or twice, with no guarantee of what would
actually happen, except that both halves would eventually make it to the
destination.

Terje

--
-
"almost all programming can be viewed as an exercise in caching"
From:Eric P.
Subject:Re: Unaligned accesses (was Re: RISC vs. CISC design principles)
Date:Wed, 19 Jan 2005 15:05:52 -0500
Terje Mathisen wrote:
>
> Eric P. wrote:
>
> > Terje Mathisen wrote:
> > Hmmm... what else might might be affected?
> >
> > - Load-Store queue must do more complex overlap checks before
> > allowing read or write reordering
>
> Not too much though: Currently it must take into consideration both base
> and length of each operations, this extension could conservatively
> extend this to be the aligned base, and the extended length.

Oops yes, I should have seen that. I kept thinking this required
arithmetic. If the largest operand is 8 bytes then round down to
a 16 byte boundary and check for overlap on the 16 byte blocks.


Eric
From:MitchAlsup at aol.com
Subject:Re: RISC vs. CISC design principles
Date:17 Jan 2005 09:03:46 -0800
"Why aren't Intel/AMD doing this already?"
Rumor has it that Dotham does this, at least for integer moves.

Mitch
From:Terje Mathisen
Subject:Re: RISC vs. CISC design principles
Date:Mon, 17 Jan 2005 21:48:29 +0100
MitchAlsup@aol.com wrote:

> "Why aren't Intel/AMD doing this already?"
> Rumor has it that Dotham does this, at least for integer moves.



Terje

--
-
"almost all programming can be viewed as an exercise in caching"
From:already5chosen at yahoo.com
Subject:Re: RISC vs. CISC design principles
Date:19 Jan 2005 12:04:38 -0800
What is you definition for "all x86 CPUs"?
I would imagine it doesn't include outsiders like SiS, Transmeta and
Geode GX.
How about VIA?
Does the definition include P6 that currently has near-zero market
share but still dominates installed base?
From:MitchAlsup at aol.com
Subject:Re: RISC vs. CISC design principles
Date:12 Jan 2005 11:56:35 -0800
"At current hardware budgets, the aligned memory access
requirement is probably the least useful of the RISC mechanisms."

I, respectfully, disagree.

At current hardware budgets, the least useful RISC mechanism is the
fixed length instruction format. Both Intel and AMD have shown that
they/we can decode just as many instructions per unit time as the RISC
guys.

Consider a x86 machine like an Athlon (or P3 or P4). How much
performance is sacrificed by having to decode multibyte instructions?
Answer; with modern branch prediction, the added pipe stages extract a
penalty around only 1%-2% compared to fixed length machines! Yet these
decoded multibyte instructions contain more semantic units of work that
the equivalent 4-ish wide RISC decoders. But, in neither catagory of
machines is the basic throughput significantly dependent upon the
performance of the decoder(s)!

I could address the rest of the statements piecemeal, however, the
general premiss is wrong. The evolution of x86 is proceeding faster
than the evolution of other CPUs bacause of the amount of cubic dollars
that can be thrown at teams of designers to solve yesterdays problems
and develop tomorows monster machines. Cubic dollars beats
architectural cleanliness everytime.

Mitch
From:Paul Rubin
Subject:Re: RISC vs. CISC design principles
Date:12 Jan 2005 12:00:18 -0800
MitchAlsup@aol.com writes:
> Consider a x86 machine like an Athlon (or P3 or P4). How much
> performance is sacrificed by having to decode multibyte instructions?
> Answer; with modern branch prediction, the added pipe stages extract a
> penalty around only 1%-2% compared to fixed length machines!

What do you mean by that? Don't those added pipe stages and decoder
logic burn a lot of silicon area that could be used for more
functional units or caches or something? What about the x86's
register starvation, couldn't code run faster with more registers?
The x86-64 supports more registers (16 instead of 8), but 16 still
isn't an awful lot, and it makes the instructions longer.
From:prep at prep.synonet.com
Subject:Re: Unaligned accesses
Date:Tue, 18 Jan 2005 04:47:22 +0800
jsavard@excxn.aNOSPAMb.cdn.invalid (John Savard) writes:

> Putting only a few extra gates on a chip to allow unaligned
> accesses, and then warning programmers that these accesses will have
> a performance penalty, so they should not be used unless really
> needed, is usually the best tradeoff, though. It eliminates a
> potential source of confusion and error at the lowest cost.

Because you are paying the gate delay penalty ofr EVERY access
that now has to go through them.

See the Alpha papers for the messy details, and more.

--
Paul Repacholi 1 Crescent Rd.,
+61 (08) 9257-1001 Kalamunda.
West Australia 6076
comp.os.vms,- The Older, Grumpier Slashdot
Raw, Cooked or Well-done, it's all half baked.
EPIC, The Architecture of the future, always has been, always will be.
From:Terje Mathisen
Subject:Re: Unaligned accesses
Date:Tue, 18 Jan 2005 08:41:05 +0100
prep@prep.synonet.com wrote:

> jsavard@excxn.aNOSPAMb.cdn.invalid (John Savard) writes:
>
>
>>Putting only a few extra gates on a chip to allow unaligned
>>accesses, and then warning programmers that these accesses will have
>>a performance penalty, so they should not be used unless really
>>needed, is usually the best tradeoff, though. It eliminates a
>>potential source of confusion and error at the lowest cost.
>
>
> Because you are paying the gate delay penalty ofr EVERY access
> that now has to go through them.

Is that really true?

_Either_ you pay the gate delay penalty of being able to detect
misaligned accesses, and convert those to a trap,

_or_ you pay the gate delay penalty of being able to detect misaligned
accesses, and convert those into slower/microcoded sequences.
:-)

I'll accept that generating a trap is probably easier, since you need
that for other problem cases (i.e. out-of-bounds) anyway, but the HW
that allows the cpu to do a realtime decision of the path to follow
should be very similar.

It is only if/when the trap is async that this really becomes worrysome,
since at this point the cpu much revert to the last checkpoint and
singlestep forward to the point of the trap.

If the same mechanism is used to handle misaligned accesses, then they
will be so slow as to make the alternate (aligned only) code sequence
faster except when misalignment is very rare.

OK, I guess I'm sorta/reluctantly agreeing with you. :-(

Terje

--
-
"almost all programming can be viewed as an exercise in caching"
From:Wilco Dijkstra
Subject:Re: Unaligned accesses
Date:Tue, 18 Jan 2005 22:41:30 GMT

"Terje Mathisen" wrote in message
news:csieii$uf6$1@osl016lin.hda.hydro.com...
> prep@prep.synonet.com wrote:
>
> > jsavard@excxn.aNOSPAMb.cdn.invalid (John Savard) writes:
> >
> >
> >>Putting only a few extra gates on a chip to allow unaligned
> >>accesses, and then warning programmers that these accesses will have
> >>a performance penalty, so they should not be used unless really
> >>needed, is usually the best tradeoff, though. It eliminates a
> >>potential source of confusion and error at the lowest cost.
> >
> >
> > Because you are paying the gate delay penalty ofr EVERY access
> > that now has to go through them.
>
> Is that really true?

No, it's not. Those extra gates are already needed to select words,
halfwords and bytes, endian swap them (perhaps dynamically) and
zero or signextend them as necessary. If you look at it you're close
to a full crossbar switch already, so it isn't much more work to
support unaligned accesses. Initial Alphas didn't support any of
this as they didn't have those gates indeed, but they did pay for this
as code using chars and shorts ran slow.

ARM is perhaps the only RISC that added support for unaligned
access due to customer demand. It speeds up code that occasionally
does do unaligned accesses as the cost on ARMs is high (>10x
slower than an aligned access as ARM has no funnel shifter).
It's essential for SIMD as unaligned access often outnumber
aligned ones (eg. SAD in motion estimation).

As Stephen Fuld guessed the hardware people didn't like it initially
but then again ARM already has instructions that can straddle up to 4
cache lines...

> _Either_ you pay the gate delay penalty of being able to detect
> misaligned accesses, and convert those to a trap,
>
> _or_ you pay the gate delay penalty of being able to detect misaligned
> accesses, and convert those into slower/microcoded sequences.
> :-)
>
> I'll accept that generating a trap is probably easier, since you need
> that for other problem cases (i.e. out-of-bounds) anyway, but the HW
> that allows the cpu to do a realtime decision of the path to follow
> should be very similar.

Indeed you have a lot more time for a trap as you only have to generate
it just before the cache returns the hit signal. However generating an
unaligned signal is so easy it can be done during effective address
generation at virtually no cost. This can then be used to stall the load
store unit for an extra cycle to access the other cacheline (the ARM11
doesn this). If the execution units are statically scheduled you'll have
to replay the load, but since cachelines are large nowadays this doesn't
matter much (see below).

> It is only if/when the trap is async that this really becomes worrysome,
> since at this point the cpu much revert to the last checkpoint and
> singlestep forward to the point of the trap.
>
> If the same mechanism is used to handle misaligned accesses, then they
> will be so slow as to make the alternate (aligned only) code sequence
> faster except when misalignment is very rare.

Assuming a 10-cycle cost for an unaligned word access crossing a
64-byte cacheline it would take 192 cycles for the replay mechanism
to be worse! So in principle it would be possible to add unaligned
access to a CPU that doesn't support it by taking a trap, inserting the
instructions for an unaligned access using a micro code engine and still
get a (small) speedup :-)

Wilco
From:Paul A. Clayton
Subject:Re: RISC vs. CISC design principles
Date:13 Jan 2005 15:58:19 GMT
In article <1105559795.307333.292340@c13g2000cwb.googlegroups.com>,
MitchAlsup@aol.com wrote:

>"At current hardware budgets, the aligned memory access
>requirement is probably the least useful of the RISC mechanisms."
>
>I, respectfully, disagree.
>
>At current hardware budgets, the least useful RISC mechanism is the
>fixed length instruction format. Both Intel and AMD have shown that
>they/we can decode just as many instructions per unit time as the RISC
>guys.
>
>Consider a x86 machine like an Athlon (or P3 or P4). How much
>performance is sacrificed by having to decode multibyte instructions?
>Answer; with modern branch prediction, the added pipe stages extract a
>penalty around only 1%-2% compared to fixed length machines! Yet these
[snip]

I assume a 1% performance penalty for complex instructions is
greater than the performanc penalty (if even positive) of supporting
unaligned memory operations in hardware. Is this incorrect?

>I could address the rest of the statements piecemeal, however, the
>general premiss is wrong. The evolution of x86 is proceeding faster
>than the evolution of other CPUs bacause of the amount of cubic dollars
>that can be thrown at teams of designers to solve yesterdays problems
>and develop tomorows monster machines. Cubic dollars beats
>architectural cleanliness everytime.

I agree.

I should not have used practicality. My concern was with comparing
design principles given a clean slate for modern tradeoffs and trying
to understand the reasoning behind the design choices that generated
CISCs and RISCs in their historical context.


Paul A. Clayton
just a technophile, not a computer professional
From:Stephen Fuld
Subject:Re: RISC vs. CISC design principles
Date:Thu, 13 Jan 2005 17:34:04 GMT

"Paul A. Clayton" wrote in message
news:20050113105819.01270.00000034@mb-m23.aol.com...

snip

> I should not have used practicality. My concern was with comparing
> design principles given a clean slate for modern tradeoffs and trying
> to understand the reasoning behind the design choices that generated
> CISCs and RISCs in their historical context.

With regard to both variable length instructions and unalligned storage
operations, you have to go back in history to the original RISC era. The
idea was that you gained so much by eliminating the off chip connection
delay in favor of everything within one chip that the single chip
requirement pretty much dominated everything else. Now look at the number
of transistors one could get on a single chip at that time. That dictated
eliminating a lot of features that might otherwise be desirable. So all
instructions being the same length saved a lot of transistors (and speeded
decoding) and that was a much bigger issue then than it is now. That is why
you see the multi-length instruction sets added in some RISC chips, and the
minimal cost of pretty much full generality of X-86 being quite fast.

With regard to unalligned memops, I think it is usefull to divide them into
two cases. The first is where the entire operation is contained within one
cache line/page. These will be much more frequent and probably are easier
to make fast. The other is where a cache line or even a page boundry is
crossed, which are much less frequent, and of course have to be done
correctly, but are less important to be fast. Note that as cache lines get
larger, the first case becomes more frequent. And since (to go back to your
question), the first RISC chips had no on-chip cache, all cases had to go
directly to memory, and unalligned accesses were much costlier (both in
terms of time and the then all important transistor count.

I suspect if some architect demanded unalligned access support in a
hypothetical new chip or new version of an existing chip that doesn't now
have it, the hardware guys would grumble a lot and then do a good job of
making it fast in the common cases and correct in all cases. But I would
appreciate comments from people who know more about that aspect of things
than I do.

--
- Stephen Fuld
e-mail address disguised to prevent spam
From:Kai Harrekilde-Petersen
Subject:Re: RISC vs. CISC design principles
Date:Sun, 16 Jan 2005 23:50:44 +0100
nmm1@cus.cam.ac.uk (Nick Maclaren) writes:

> In article ,
> Andi Kleen wrote:
>>nmm1@cus.cam.ac.uk (Nick Maclaren) writes:
>>>
>>> In my career, I have never seen a significant use of it except to
>>> cover up misdesigned interfaces - in particular, ones that have
>>> failed to take the decision whether they are based on semi-abstract
>>> types like integers and floating-point or on precisely specified
>>> bit patterns.
>>
>>It's useful to process IPv4 packets. On a aligned ethernet packet the
>>TCP header ends up being unaligned. Same is true for other protocols.
>
> That is precisely what I am describing as a misdesigned protocol.

Are you poking at the 14 byte Ethernet header or the IPv4 header here?
- I thought the IPv4 header was quite well-laid out, with everything
aligned to natural boundaries.

Regards,

Kai
--
Kai Harrekilde-Petersen
From:Nick Maclaren
Subject:Re: RISC vs. CISC design principles
Date:12 Jan 2005 20:32:50 GMT
In article <1105559795.307333.292340@c13g2000cwb.googlegroups.com>,
wrote:
>"At current hardware budgets, the aligned memory access
>requirement is probably the least useful of the RISC mechanisms."
>
>I, respectfully, disagree.
>
>At current hardware budgets, the least useful RISC mechanism is the
>fixed length instruction format. Both Intel and AMD have shown that
>they/we can decode just as many instructions per unit time as the RISC
>guys.

Yes. But let's ignore that and go back to the alignment issue.
Speaking as a software engineer from way back:

"Allowing unaligned memory access is probably the least useful
of common CISC features."

In my career, I have never seen a significant use of it except to
cover up misdesigned interfaces - in particular, ones that have
failed to take the decision whether they are based on semi-abstract
types like integers and floating-point or on precisely specified
bit patterns.

The point is that the former have no trouble with padding being
inserted to create alignment, and the latter are uniformly better
done by the use of packing and unpacking primitives because there
are almost certainly other things to fix up than alignment (e.g.
endianness).


Regards,
Nick Maclaren.
From:Terje Mathisen
Subject:Re: RISC vs. CISC design principles
Date:Thu, 13 Jan 2005 08:29:41 +0100
Nick Maclaren wrote:

> In article <1105559795.307333.292340@c13g2000cwb.googlegroups.com>,
> wrote:
>>At current hardware budgets, the least useful RISC mechanism is the
>>fixed length instruction format. Both Intel and AMD have shown that
>>they/we can decode just as many instructions per unit time as the RISC
>>guys.
>
>
> Yes. But let's ignore that and go back to the alignment issue.
> Speaking as a software engineer from way back:
>
> "Allowing unaligned memory access is probably the least useful
> of common CISC features."
>
> In my career, I have never seen a significant use of it except to
> cover up misdesigned interfaces - in particular, ones that have
> failed to take the decision whether they are based on semi-abstract
> types like integers and floating-point or on precisely specified
> bit patterns.

A recent post in c.l.a.x86 made me go back to C string.h functions, as
well as the *BSD-inspired strl*() replacements.

Efficient handling of C strings pretty much requires you to process a
full register's worth of data (usually 4 or 8 chars), while you cannot
depend on either the source or destination to be properly aligned, right?

Besides alignement, another key problem is that the terminating zero
byte in the source string could well be the last byte in a memory block,
meaning that any access past this point will cause a trap.

Handling both of these at the same time pretty much requires either
unaligned load and/or store operations, together with the capability to
do non-trapping (speculative) load operations past the end of allocated
memory, or you need to re-invent the Alpha:

I.e. load operations that disregard the bottommost (alignment) bits,
together with fast shift/mask/merge operations based on those same bits,
so that you can synthesize unaligned operations this way.

> The point is that the former have no trouble with padding being
> inserted to create alignment, and the latter are uniformly better
> done by the use of packing and unpacking primitives because there
> are almost certainly other things to fix up than alignment (e.g.
> endianness).

Even these are much better off if you can specify them in such a way as
to allow the compiler to generate optimal code, i.e. not just a set of
byte load/shift/merge operations.

Terje

--
-
"almost all programming can be viewed as an exercise in caching"
From:Nick Maclaren
Subject:Re: RISC vs. CISC design principles
Date:13 Jan 2005 11:02:46 GMT

In article ,
Terje Mathisen writes:
|>
|> A recent post in c.l.a.x86 made me go back to C string.h functions, as
|> well as the *BSD-inspired strl*() replacements.
|>
|> Efficient handling of C strings pretty much requires you to process a
|> full register's worth of data (usually 4 or 8 chars), while you cannot
|> depend on either the source or destination to be properly aligned, right?

Well, there are other approaches, but I agree that is one of the
few that works on most current hardware.

|> Besides alignement, another key problem is that the terminating zero
|> byte in the source string could well be the last byte in a memory block,
|> meaning that any access past this point will cause a trap.
|>
|> Handling both of these at the same time pretty much requires either
|> unaligned load and/or store operations, together with the capability to
|> do non-trapping (speculative) load operations past the end of allocated
|> memory, or you need to re-invent the Alpha:

Well, no, it doesn't. Look at BSD libraries for examples of how
you can (semi-portably) use only aligned loads and stores.

Also, you are asking for MUCH more than just unaligned operations.
For example, you are relying on two features that are generally
not the case:

1) EFFICIENT unaligned loads and stores, whether in terms of
cycle counts, cache use or TLB use.

2) The non-trapping aspects you mentioned, which can be very
important indeed.

3) No error detection in such operations. This is less obvious,
but I have almost never seen code that operates that way (including
relying on non-trapping aspects) AND correctly traps when using a
genuinely invalid location.


Regards,
Nick Maclaren.
From:Terje Mathisen
Subject:Re: RISC vs. CISC design principles
Date:Thu, 13 Jan 2005 15:40:04 +0100
Nick Maclaren wrote:

> In article ,
> Terje Mathisen writes:
> |>
> |> A recent post in c.l.a.x86 made me go back to C string.h functions, as
> |> well as the *BSD-inspired strl*() replacements.
> |>
> |> Efficient handling of C strings pretty much requires you to process a
> |> full register's worth of data (usually 4 or 8 chars), while you cannot
> |> depend on either the source or destination to be properly aligned, right?
>
> Well, there are other approaches, but I agree that is one of the
> few that works on most current hardware.

My statement was re. requirements, not implementations. :-)

The proper way to handle this on a CISC is to have REP SCASB/REP MOVSB
opcodes that actually do the right thing with the hw, in all versions of
the architecture and in all combinations of memory types.

However, according to many notes from Andy Glew, this isn't very likely
to happen. :-(

> |> Besides alignement, another key problem is that the terminating zero
> |> byte in the source string could well be the last byte in a memory block,
> |> meaning that any access past this point will cause a trap.
> |>
> |> Handling both of these at the same time pretty much requires either
> |> unaligned load and/or store operations, together with the capability to
> |> do non-trapping (speculative) load operations past the end of allocated
> |> memory, or you need to re-invent the Alpha:
>
> Well, no, it doesn't. Look at BSD libraries for examples of how
> you can (semi-portably) use only aligned loads and stores.

You can obviously do it quite portably, even endian-independently, by
just being able to assume that a register will hold a power of two
number of 8-bit characters. Doing the same with 36 or 60-bit register
sizes is slightly harder, unless you're allowed some init code to detect
the current environment. :-)

strlen() is easy: Load single bytes until alignment is reached, then
process (safely, since a memory block cannot end in the middle of an
aligned word!) full words until one is found that contains at least one
zero byte.

At this point you switch back to reloading and checking each character,
or if you could setup an array of masks at startup, just check the
current word against those masks. (Due to cache-misses, the first option
might well be the faster one.)

The copying operations (strcpy, strlcpy, strlcat, strncpy etc) are
harder because you want to use aligned accesses for both source and
destination, which means that you _must_ do some form of
shift/mask/merge to convert from source to destination alignment, and
this cannot be done both portably and efficiently without introducing
some level of endian-dependent coding.
>
> Also, you are asking for MUCH more than just unaligned operations.
> For example, you are relying on two features that are generally
> not the case:
>
> 1) EFFICIENT unaligned loads and stores, whether in terms of
> cycle counts, cache use or TLB use.

unaligned load operations have always been very efficient on x86, as
long as the load didn't straddle a cache line boundary. I.e. the
effective overhead of reading a stream this way is _much_ lower than the
cost of shifting a set of aligned loads!

> 2) The non-trapping aspects you mentioned, which can be very
> important indeed.

They help a lot by allowing the last load to straddle the end of the
buffer, yeah.
>
> 3) No error detection in such operations. This is less obvious,
> but I have almost never seen code that operates that way (including
> relying on non-trapping aspects) AND correctly traps when using a
> genuinely invalid location.

I've seen it done, by having special non-trapping load operations.

This will work as long as the input was valid, i.e. a terminating zero
was actually found.

The faster solution is to be able to have a user-level trap of such a
load, and turn it into a load of zeroes. That way you can safely load a
few words past the end of the input (for unrolling), while still never
writing beyond the terminating zero of the output.

Terje

--
-
"almost all programming can be viewed as an exercise in caching"
From:Nick Maclaren
Subject:Re: RISC vs. CISC design principles
Date:13 Jan 2005 15:17:55 GMT

In article ,
Terje Mathisen writes:
|>
|> strlen() is easy: Load single bytes until alignment is reached, then
|> process (safely, since a memory block cannot end in the middle of an
|> aligned word!) full words until one is found that contains at least one
|> zero byte.
|>
|> At this point you switch back to reloading and checking each character,
|> or if you could setup an array of masks at startup, just check the
|> current word against those masks. (Due to cache-misses, the first option
|> might well be the faster one.)

That's what the BSD code did when I looked at it.

|> > 3) No error detection in such operations. This is less obvious,
|> > but I have almost never seen code that operates that way (including
|> > relying on non-trapping aspects) AND correctly traps when using a
|> > genuinely invalid location.
|>
|> I've seen it done, by having special non-trapping load operations.
|>
|> This will work as long as the input was valid, i.e. a terminating zero
|> was actually found.

But did those correctly diagnose the error if the input was NOT
valid? That is what I meant.


Regards,
Nick Maclaren.
From:Terje Mathisen
Subject:Re: RISC vs. CISC design principles
Date:Thu, 13 Jan 2005 21:51:39 +0100
Nick Maclaren wrote:

> In article ,
> Terje Mathisen writes:
> That's what the BSD code did when I looked at it.

Not too surprising, it is the obvious solution. :-)

> |> This will work as long as the input was valid, i.e. a terminating zero
> |> was actually found.
>
> But did those correctly diagnose the error if the input was NOT
> valid? That is what I meant.

To get correct behaviour, you'll have to reload the last (aligned!)
word, using regular (trapping) operations.

This is actually similar to the way you can rewrite Java programs to
only require tests at buffer ends, split the code path, and then add a
known-to-trap load in the case where that should have happened. :-)

Terje

--
-
"almost all programming can be viewed as an exercise in caching"
From:Stephen Fuld
Subject:Re: RISC vs. CISC design principles
Date:Wed, 12 Jan 2005 22:01:33 GMT

"Nick Maclaren" wrote in message
news:cs41hi$62t$1@gemini.csx.cam.ac.uk...

snip

> Yes. But let's ignore that and go back to the alignment issue.
> Speaking as a software engineer from way back:
>
> "Allowing unaligned memory access is probably the least useful
> of common CISC features."
>
> In my career, I have never seen a significant use of it except to
> cover up misdesigned interfaces - in particular, ones that have
> failed to take the decision whether they are based on semi-abstract
> types like integers and floating-point or on precisely specified
> bit patterns.

While I don't doubt that is true, it is perhaps so due to your specializing
in HPC and not say business data processing. Think COBOL, reports where the
aesthetics of the output are more important than allignment considerations,
dealing with arbitrary input files, etc.

--
- Stephen Fuld
e-mail address disguised to prevent spam
From:Nick Maclaren
Subject:Re: RISC vs. CISC design principles
Date:12 Jan 2005 23:25:45 GMT
In article <17hFd.8944$7N1.38@bgtnsc04-news.ops.worldnet.att.net>,
Stephen Fuld wrote:
>
>> Yes. But let's ignore that and go back to the alignment issue.
>> Speaking as a software engineer from way back:
>>
>> "Allowing unaligned memory access is probably the least useful
>> of common CISC features."
>>
>> In my career, I have never seen a significant use of it except to
>> cover up misdesigned interfaces - in particular, ones that have
>> failed to take the decision whether they are based on semi-abstract
>> types like integers and floating-point or on precisely specified
>> bit patterns.
>
>While I don't doubt that is true, it is perhaps so due to your specializing
>in HPC and not say business data processing. Think COBOL, reports where the
>aesthetics of the output are more important than allignment considerations,
>dealing with arbitrary input files, etc.

Where did you get the idea from that I specialised in HPC for most
of my career? I can assure you that is not so.

Firstly, formatted I/O is irrelevant, as that is always treated as
characters on modern machines.

Secondly, the paragraph that you snipped explains why all portable
programs (and most correct ones) use packing and unpacking primitives
when dealing with arbitrary (binary) input files.

Please note that I have written serious code to convert MVS SL tapes
using most BSAM/QSAM formats to Unix tar files and MS-DOS and MacOS
ZIP files. And vice versa. Plus a good many related tasks, including
reading M-bit data on N-bit systems. That is about as 'commercial'
an application as you get :-)

And I have always been very much into producing aesthetic output,
not least because properly aligned tables are a damn sight easier to
check than unaligned ones, and I spent more years in the statistical
area than the HPC one.

No, sorry. I was posting more with a 'commercial' hat on than an
HPC one.


Regards,
Nick Maclaren.
From:Stephen Fuld
Subject:Re: RISC vs. CISC design principles
Date:Thu, 13 Jan 2005 17:03:46 GMT

"Nick Maclaren" wrote in message
news:cs4blp$pvo$1@gemini.csx.cam.ac.uk...
> In article <17hFd.8944$7N1.38@bgtnsc04-news.ops.worldnet.att.net>,
> Stephen Fuld wrote:
>>
>>> Yes. But let's ignore that and go back to the alignment issue.
>>> Speaking as a software engineer from way back:
>>>
>>> "Allowing unaligned memory access is probably the least useful
>>> of common CISC features."
>>>
>>> In my career, I have never seen a significant use of it except to
>>> cover up misdesigned interfaces - in particular, ones that have
>>> failed to take the decision whether they are based on semi-abstract
>>> types like integers and floating-point or on precisely specified
>>> bit patterns.
>>
>>While I don't doubt that is true, it is perhaps so due to your
>>specializing
>>in HPC and not say business data processing. Think COBOL, reports where
>>the
>>aesthetics of the output are more important than allignment
>>considerations,
>>dealing with arbitrary input files, etc.
>
> Where did you get the idea from that I specialised in HPC for most
> of my career? I can assure you that is not so.

I'm sorry for the mistake. Obviously, I only know you from your posts here
and I inferred (apparently incorrectly) that most of your experience was
with HPC.

>
> Firstly, formatted I/O is irrelevant, as that is always treated as
> characters on modern machines.

Yes, but business data processing seems to do more of it than say HPC.

> Secondly, the paragraph that you snipped explains why all portable
> programs (and most correct ones) use packing and unpacking primitives
> when dealing with arbitrary (binary) input files.

But don't these primitives benefit from being able to handle unalligned data
efficiently?

--
- Stephen Fuld
e-mail address disguised to prevent spam
From:Nick Maclaren
Subject:Re: RISC vs. CISC design principles
Date:13 Jan 2005 17:49:49 GMT
In article ,
Stephen Fuld wrote:
>
>> Secondly, the paragraph that you snipped explains why all portable
>> programs (and most correct ones) use packing and unpacking primitives
>> when dealing with arbitrary (binary) input files.
>
>But don't these primitives benefit from being able to handle unalligned data
>efficiently?

Yes and no. Because of the endian and other problems I mentioned,
there is little point in accessing the data DIRECTLY - macros or
functions are always a better solution. And the difference in
efficiency between using (say) unaligned integer loads and loading
a character at a time is usually small.


Regards,
Nick Maclaren.
From:Maynard Handley
Subject:Unaligned accesses (was Re: RISC vs. CISC design principles)
Date:Thu, 13 Jan 2005 01:46:34 GMT
In article ,
nmm1@cus.cam.ac.uk (Nick Maclaren) wrote:

> In article <1105559795.307333.292340@c13g2000cwb.googlegroups.com>,
> wrote:
> >"At current hardware budgets, the aligned memory access
> >requirement is probably the least useful of the RISC mechanisms."
> >
> >I, respectfully, disagree.
> >
> >At current hardware budgets, the least useful RISC mechanism is the
> >fixed length instruction format. Both Intel and AMD have shown that
> >they/we can decode just as many instructions per unit time as the RISC
> >guys.
>
> Yes. But let's ignore that and go back to the alignment issue.
> Speaking as a software engineer from way back:
>
> "Allowing unaligned memory access is probably the least useful
> of common CISC features."
>
> In my career, I have never seen a significant use of it except to
> cover up misdesigned interfaces - in particular, ones that have
> failed to take the decision whether they are based on semi-abstract
> types like integers and floating-point or on precisely specified
> bit patterns.
>
> The point is that the former have no trouble with padding being
> inserted to create alignment, and the latter are uniformly better
> done by the use of packing and unpacking primitives because there
> are almost certainly other things to fix up than alignment (e.g.
> endianness).
>

You obviously have never programmed AltiVec, have you, Nick?

While I understand why AltiVec does not allow for unaligned accesses,
and accept that it may well have been and continue to be the correct
tradeoff, the fact is that it is a pain to deal with. And, Nick, please
don't give me any BS about how properly designed code would not require
this. If you've no experience with either AltiVec programming or modern
day audio and video compression algorithms, you're not in a position to
make this claim.

On the other hand, regarding unaligned instructions; is density of
instructions (either inability to load them fast enough, or capacity of
I1$ or I TLB) both really a big deal AND only about a factor of 1.5 off
from ideal, meaning that unaligned instructions are worthwhile? The
window for codes that meet both these requirements strikes me as pretty
small, and I'd have to see some real evidence that the costs of I1$
misses (high but infrequent) are larger than the costs of an extra few
cycles on branch misses (fewer cycles but frequent), not to mention the
extra power and associated issues.

Maynard
From:Andrew Reilly
Subject:Re: Unaligned accesses (was Re: RISC vs. CISC design principles)
Date:Thu, 13 Jan 2005 14:00:05 +1100
On Thu, 13 Jan 2005 02:46:34 +0000, Maynard Handley wrote:

> In article ,
> nmm1@cus.cam.ac.uk (Nick Maclaren) wrote:
>
>> In article <1105559795.307333.292340@c13g2000cwb.googlegroups.com>,
>> wrote:
>> >"At current hardware budgets, the aligned memory access
>> >requirement is probably the least useful of the RISC mechanisms."
>> >
>> >I, respectfully, disagree.
>> >
>> >At current hardware budgets, the least useful RISC mechanism is the
>> >fixed length instruction format. Both Intel and AMD have shown that
>> >they/we can decode just as many instructions per unit time as the RISC
>> >guys.
>>
>> Yes. But let's ignore that and go back to the alignment issue.
>> Speaking as a software engineer from way back:
>>
>> "Allowing unaligned memory access is probably the least useful
>> of common CISC features."
>>
>> In my career, I have never seen a significant use of it except to
>> cover up misdesigned interfaces - in particular, ones that have
>> failed to take the decision whether they are based on semi-abstract
>> types like integers and floating-point or on precisely specified
>> bit patterns.
>>
>> The point is that the former have no trouble with padding being
>> inserted to create alignment, and the latter are uniformly better
>> done by the use of packing and unpacking primitives because there
>> are almost certainly other things to fix up than alignment (e.g.
>> endianness).
>>
>
> You obviously have never programmed AltiVec, have you, Nick?

What's that got to do with anything? (I haven't programmed AltiVec,
per-se, myself. If a compiler has done it on my behalf, good on it. If a
compiler hasn't been able to do it on my behalf, then perhaps that says
something about the architecture of AltiVec.)

> While I understand why AltiVec does not allow for unaligned accesses,
> and accept that it may well have been and continue to be the correct
> tradeoff, the fact is that it is a pain to deal with. And, Nick, please
> don't give me any BS about how properly designed code would not require
> this. If you've no experience with either AltiVec programming or modern
> day audio and video compression algorithms, you're not in a position to
> make this claim.

I would say that modern-day audio and video compression standards are a
good example of file (and communication) formats done *well*, by Nick's
standards, as they are universally (in my experience) defined in terms of
packed bit-strings, rather than fwrite(c-struct) /* and-hope-it-ports-ok,
later */, which was what Nick was complaining about (I believe).

At an audio *algorithm* level, rather than file format level, I've never
encountered anything that would enforce or encourage unaligned floating
point accesses, which is just as well, since most of the DSPs I code for
are still word-addressed.

--
Andrew
From:Maynard Handley
Subject:Re: Unaligned accesses (was Re: RISC vs. CISC design principles)
Date:Fri, 14 Jan 2005 02:45:16 GMT
In article ,
Andrew Reilly wrote:

> On Thu, 13 Jan 2005 02:46:34 +0000, Maynard Handley wrote:
>
> > In article ,
> > nmm1@cus.cam.ac.uk (Nick Maclaren) wrote:
> >
> >> In article <1105559795.307333.292340@c13g2000cwb.googlegroups.com>,
> >> wrote:
> >> >"At current hardware budgets, the aligned memory access
> >> >requirement is probably the least useful of the RISC mechanisms."

> >> In my career, I have never seen a significant use of it except to
> >> cover up misdesigned interfaces - in particular, ones that have
> >> failed to take the decision whether they are based on semi-abstract
> >> types like integers and floating-point or on precisely specified
> >> bit patterns.
> >>
> >
> > You obviously have never programmed AltiVec, have you, Nick?
>
> What's that got to do with anything? (I haven't programmed AltiVec,
> per-se, myself. If a compiler has done it on my behalf, good on it. If a
> compiler hasn't been able to do it on my behalf, then perhaps that says
> something about the architecture of AltiVec.)

So let's review.
Nick says "unaligned memory access is not very useful".
I say that it sodding well is useful, it's a shame (though very
understandable) that AltiVec does not provide it, and that ten years of
experience working with modern codecs has shown me many many situations
where material is NOT "naturally" aligned.

Your response to that, parroted by Nick, is
"I've never actually used AltiVec (but you're wrong anyway), and by the
way modern codecs do a fine job of describing how the bit stream is
packed".
Excuse me for barfing at the sheer pointlessness of this reply, since
the packedness of the material in the bitstream has ZERO to do with the
issue of how well it is adapted to naturally aligned packing. Heck, even
the most clueless undergrad should know that the first stage in decoding
data (or the last stage in encoding data) consist of bit-parsing and
twiddling to handle the entropy coding, usually followed by a table
lookup. It's only at that point that you handle modelling (transforms,
motion comp and so on) which is where something like AltiVec is useful.

My whole point was that the specific nature of these codecs (for example
the way that H264 breaks the image up into variable sized blocks which
can be as small as 4x4) means that however you slice and dice the
problem (and you have complete control over the memory structures ---
these are all internal) you're going to spend a lot of your time wanting
to load vectors that are not aligned to a multiple of 16.

If Nick wants to say that unaligned memory access is not useful for his
little corner of the world, a corner that does not deal with
multi-media, that's fine. But Nick, as is his way, is very fond of
making grandiose claims for the entire freaking computer universe.

(More about audio algorithms below.)

> > While I understand why AltiVec does not allow for unaligned accesses,
> > and accept that it may well have been and continue to be the correct
> > tradeoff, the fact is that it is a pain to deal with. And, Nick, please
> > don't give me any BS about how properly designed code would not require
> > this. If you've no experience with either AltiVec programming or modern
> > day audio and video compression algorithms, you're not in a position to
> > make this claim.
>
> I would say that modern-day audio and video compression standards are a
> good example of file (and communication) formats done *well*, by Nick's
> standards, as they are universally (in my experience) defined in terms of
> packed bit-strings, rather than fwrite(c-struct) /* and-hope-it-ports-ok,
> later */, which was what Nick was complaining about (I believe).
>
> At an audio *algorithm* level, rather than file format level, I've never
> encountered anything that would enforce or encourage unaligned floating
> point accesses, which is just as well, since most of the DSPs I code for
> are still word-addressed.

So, for example, if one is dealing with, say, MPEG audio, one is faced
with the problem of computing the convolution at pretty much the last
stage of the algorithm, using an index that increments by one each
iteration --- meaning that 3 times out of 4 the data one wants to load
is not naturally aligned with AltiVec 16-byte wide (ie 4 fp wide)
registers.

Maynard
From:Andrew Reilly
Subject:Re: Unaligned accesses (was Re: RISC vs. CISC design principles)
Date:Fri, 14 Jan 2005 17:00:23 +1100
On Fri, 14 Jan 2005 03:45:16 +0000, Maynard Handley wrote:

> So let's review.
> Nick says "unaligned memory access is not very useful".

And the context was "RISC vs CISC", and (to me) the unalignedness was in
terms of individual words of whatever sort. Natural alignment of data
types. That's not a really big restriction, and I'll stick by saying that
it's no biggie.

> I say that it sodding well is useful, it's a shame (though very
> understandable) that AltiVec does not provide it, and that ten years of
> experience working with modern codecs has shown me many many situations
> where material is NOT "naturally" aligned.

And it seems, now, that you've lept in and said that because AltiVec
requires alignment not just on floating point boundaries but on entire
fixed-length vectors of them, that "unaligned access" (without further
restriction) is necessary. Of course. However understandable (your
words), that does seem to be a pretty crippling defficiency of AltiVec,
particularly for nearly all of the audio signal processing algorithms that
I can think of. How "RISC" is AltiVec if compilers can't use it to help
speed up existing algorithms and existing code? Is it RISC just because
it has a monumental alignment restriction?

All I can say to that argument is that it's a pretty daft extension to the
notion of "natural alignment", particularly if the object of the exercise
is to be able to compute existing numeric algorithms efficiently, rather
that just being able to claim the best peak flops numbers.

> Your response to that, parroted by Nick, is
> "I've never actually used AltiVec (but you're wrong anyway), and by the
> way modern codecs do a fine job of describing how the bit stream is
> packed".
> Excuse me for barfing at the sheer pointlessness of this reply, since
> the packedness of the material in the bitstream has ZERO to do with the
> issue of how well it is adapted to naturally aligned packing.

Try reading the thread again, after your barf. The issue being
responded-to was "unaligned" values occurring in popular (but perhaps
poorly or unfortunately specced) file and wire formats. The sentence
above makes no sense at all in the context of the discussion.

> Heck, even
> the most clueless undergrad should know that the first stage in decoding
> data (or the last stage in encoding data) consist of bit-parsing and
> twiddling to handle the entropy coding, usually followed by a table
> lookup.

Yup. Nicely defined, and access not susceptable to endianness or
alignment issues. Not like many disk file and network protocols, which
are pretty much defined as fwrite(desc, *(some_C_struct), 1,
sizeof(*some_C_struct)), on some specific computer system, to the eventual
annoyance of anyone using a system with different alignment/ endianness/
compiler struct padding / compiler switches/ etc.

> It's only at that point that you handle modelling (transforms, motion
> comp and so on) which is where something like AltiVec is useful.

You brought AltiVec up. Hadn't been mentioned before in the thread. We
*had* been discussng file and wire formats and alignment issues, though.

> My whole point was that the specific nature of these codecs (for example
> the way that H264 breaks the image up into variable sized blocks which
> can be as small as 4x4) means that however you slice and dice the
> problem (and you have complete control over the memory structures ---
> these are all internal) you're going to spend a lot of your time wanting
> to load vectors that are not aligned to a multiple of 16.

Yup. That's how maths works. You don't, however, ever need to read any
of those individual floating point numbers from non-aligned addresses.

> If Nick wants to say that unaligned memory access is not useful for his
> little corner of the world, a corner that does not deal with
> multi-media, that's fine. But Nick, as is his way, is very fond of
> making grandiose claims for the entire freaking computer universe.
>
> (More about audio algorithms below.)
>
>> > While I understand why AltiVec does not allow for unaligned accesses,
>> > and accept that it may well have been and continue to be the correct
>> > tradeoff, the fact is that it is a pain to deal with. And, Nick,
>> > please don't give me any BS about how properly designed code would
>> > not require this. If you've no experience with either AltiVec
>> > programming or modern day audio and video compression algorithms,
>> > you're not in a position to make this claim.
>>
>> I would say that modern-day audio and video compression standards are a
>> good example of file (and communication) formats done *well*, by Nick's
>> standards, as they are universally (in my experience) defined in terms
>> of packed bit-strings, rather than fwrite(c-struct) /*
>> and-hope-it-ports-ok, later */, which was what Nick was complaining
>> about (I believe).
>>
>> At an audio *algorithm* level, rather than file format level, I've
>> never encountered anything that would enforce or encourage unaligned
>> floating point accesses, which is just as well, since most of the DSPs
>> I code for are still word-addressed.
>
> So, for example, if one is dealing with, say, MPEG audio, one is faced
> with the problem of computing the convolution at pretty much the last
> stage of the algorithm, using an index that increments by one each
> iteration --- meaning that 3 times out of 4 the data one wants to load
> is not naturally aligned with AltiVec 16-byte wide (ie 4 fp wide)
> registers.

Well, that sucks. Doesn't AltiVec have permutation operations to at least
help with that sort of thing?

Is there no scope for doing the loop-order inversion trick, so that the
words in your altivec vectors are successive bins, and the shifting-order
index is over blocks of bins? That tends to need more memory bandwidth
than the in-register accumulator approach, but maybe machines with AltiVec
have such bandwidth (in cache, anyway)?

I'd just note that AltiVec and its restrictions don't by any means define
the universe of multimedia and audio implementation strategies. Lots of
that still takes place on DSPs and other embedded processors that work
just fine one word at a time.

--
Andrew
From:Christian Bau
Subject:Re: Unaligned accesses (was Re: RISC vs. CISC design principles)
Date:Fri, 14 Jan 2005 08:51:17 +0000
In article ,
Andrew Reilly wrote:

> Well, that sucks. Doesn't AltiVec have permutation operations to at least
> help with that sort of thing?

Obviously "aligned" vs. "unaligned" is always in terms of what you are
trying to process. If you try to process a single floating point number,
then four byte alignment = aligned, anything else = unaligned. If you
try to process vectors of four floating point numbers, then sixteen
bytes = aligned, anything else = unaligned. Especially floating-point
aligned != vector aligned.

And obviously Altivec has permutation operations, there are all kinds of
tricks that you can use to make things faster (lets just say it smokes
anything that is on any Intel processor). That doesn't change the fact
that without alignment restrictions, some of these tricks wouldn't be
needed.

> I'd just note that AltiVec and its restrictions don't by any means define
> the universe of multimedia and audio implementation strategies. Lots of
> that still takes place on DSPs and other embedded processors that work
> just fine one word at a time.

The discussion was not how much one particular processor is used; the
discussion was about the importance of unaligned accesses. Vector
processors need unaligned access much more than non-vector processors.
From:Nick Maclaren
Subject:Re: Unaligned accesses (was Re: RISC vs. CISC design principles)
Date:14 Jan 2005 08:56:57 GMT
In article ,
Christian Bau wrote:
>In article ,
> Andrew Reilly wrote:
>
>> Well, that sucks. Doesn't AltiVec have permutation operations to at least
>> help with that sort of thing?
>
>Obviously "aligned" vs. "unaligned" is always in terms of what you are
>trying to process. If you try to process a single floating point number,
>then four byte alignment = aligned, anything else = unaligned. If you
>try to process vectors of four floating point numbers, then sixteen
>bytes = aligned, anything else = unaligned. Especially floating-point
>aligned != vector aligned.

That's bonkers. What alignment does it require for vectors of length
three, or doesn't it allow them?


Regards,
Nick Maclaren.
From:Nick Maclaren
Subject:Re: Unaligned accesses (was Re: RISC vs. CISC design principles)
Date:13 Jan 2005 10:54:05 GMT

In article ,
Andrew Reilly writes:
|>
|> I would say that modern-day audio and video compression standards are a
|> good example of file (and communication) formats done *well*, by Nick's
|> standards, as they are universally (in my experience) defined in terms of
|> packed bit-strings, rather than fwrite(c-struct) /* and-hope-it-ports-ok,
|> later */, which was what Nick was complaining about (I believe).

Precisely.


Regards,
Nick Maclaren.
From:John Savard
Subject:Re: Unaligned accesses (was Re: RISC vs. CISC design principles)
Date:Sat, 15 Jan 2005 17:45:08 GMT
On Thu, 13 Jan 2005 01:46:34 GMT, Maynard Handley
wrote, in part:

>You obviously have never programmed AltiVec, have you, Nick?
>
>While I understand why AltiVec does not allow for unaligned accesses,
>and accept that it may well have been and continue to be the correct
>tradeoff, the fact is that it is a pain to deal with.

AltiVec is a feature similar to MMX.

It works with small vectors which contain several items of a given data
type.

It certainly is true that forcing these vectors to be aligned on a
256-bit boundary will impact many perfectly legitimate programming
operations.

But that doesn't change the fact that it is very seldom necessary to
allow a 64-bit floating-point number to start on an odd 32-bit boundary,
and so on. If one has a compressed record format that includes 32-bit
integer fields starting at odd bytes, one just uses byte instructions to
construct the records.

Putting only a few extra gates on a chip to allow unaligned accesses,
and then warning programmers that these accesses will have a performance
penalty, so they should not be used unless really needed, is usually the
best tradeoff, though. It eliminates a potential source of confusion and
error at the lowest cost.

Pipelined arithmetic units allow for vector operations which allow
overlapped, rather than simultaneous, operation on successive vector
elements. The Cray and its predecessors are examples of this. While
there's nothing wrong with having a parallel vector unit as well, it can
be pipelined too, and vectorized as well: that is, given vector
instructions that act on vectors whose length is a multiple of the
length of the vectors on which it operates as elementary units.

Thus, when the fast wide arithmetic unit won't do, just use a vector
instruction on the slow narrow arithmetic unit. Since they're two
different arithmetic units, they could even be running at the same time,
so that rather than having fewer FLOPS by using the slower arithmetic
unit occasionally, one ends up with more FLOPS!

John Savard
http://home.ecn.ab.ca/~jsavard/index.html
From:Bernd Paysan
Subject:Re: RISC vs. CISC design principles
Date:Thu, 13 Jan 2005 12:30:19 +0100