A brief(ish) description of BCPL

Everyone and their pet cats know about C, but relatively few people realize that C was indirectly derived from a language called BCPL, which stands for "Basic CPL"; CPL in turn stands for "Combined Programming Language".

I first encountered BCPL when I was at Cambridge University, the home of the language and its designer, Martin Richards. BCPL was the systems programming language of choice at Cambridge, because it generated code better than the IBM Fortran compilers while not requiring the agony of Assembler. In turn, Computer Science students were taught it as their first "serious" language (interestingly enough, not by Martin Richards). For many (myself not included) it was their first introduction to anything more advanced than Fortran 66.

After graduation, I went to work for a local manufacturer of Z80-based systems that did almost all of its systems programming in BCPL. Only after the move to Unix did we in turn move to C. As a result, I find myself with several years of practical BCPL programming experience.

When the topic came up once again on the net, I wrote this summary of the language for some readers who already know C. It is in no sense a comparison or a historical survey but simply a high-speed description. Feedback on this description is welcome; I have no doubt that there are parts which are still unduly terse and would benefit from expansion, but I'd like to know which !

BCPL implementations are available from Martin Richards, or from the old BCPL distribution.

Clive D.W. Feather

Types

BCPL is an "operator-typed" language, rather than a "declaration-typed" language. That is, each object in the language can be viewed as having any type, and can change type at any time. It is the operators used to manipulate the object that determine the type it has at that moment, and nothing else.

The memory of a BCPL program is divided into cells. All cells have the same size. A cell stores a single value which can be treated as any of:

an integer - binary is not assumed by the language definition
a bit pattern
an address
a procedure designator
a floating point number
a selector
a jump target
a jump closure
a stream designator.

Except where stated, the conversion between these types is implementation defined.

Every cell has an address. Under certain circumstances, two or more cells are required to be adjacent in memory. In this case, their addresses, if treated as integers, are consecutive.

In addition, a group of adjacent cells may be treated as a character buffer. There are at least BYTESPERWORD characters for every cell (though possibly more).

Cells have a storage duration, which can be static, dynamic (associated with a scope), or heap.

Identifiers

Identifiers are comprised of letters, digits, and dots, and begin with a letter. There is no limit on the length of names. Case sensitivity is implementation-defined.

Constants

Integer constants are decimal; prefix # means octal, and #x hexadecimal.

TRUE and FALSE are reserved words.

? is a constant with an indeterminate value (which may be different each time it is evaluated); any constant expression containing an evaluated ? is itself a ? .

Character constants are of the form 'a'; there are no multi-character constants.

String constants are of the form "abc"; they contain between 0 and 255 characters.

Escape sequences in both string and character constants are *", *', ** (representing the second character) and *N (newline), *T (tab), *S (space), *B (backspace), and *P (pagethrow). In addition, a string can be extended over more than one line; the sequence *<white space>* is ignored within a string.

A string constant is replaced by the address of a large enough number of contiguous cells initialized so that:

    "abc" % 0 = 3
    "abc" % 1 = 'a'
    "abc" % 2 = 'b'
    "abc" % 3 = 'c'

The cells have static storage duration.

A constant expression may use any operator, any constants other than string constants, and any identifier declared as a manifest.

Floating-point constants are of the form 1.2, 1.2e-3, or 1e+2.

Operators

Parentheses may group expressions to control precedence. Function call has higher precedence than all operators. The operators are:

   ! (dyadic) % ::           highest precedence
   @ ! (monadic)
   * / REM #* #/
   + - #+ #- (dyadic and monadic)
   = ~= < <= > >= #= #~= #< #<= #> #>=
   << >>
   ~
   &
   |
   EQV NEQV
   ->
   TABLE
   VALOF                     lowest precedence
   SLCT                      precedence not described in R&W-S

Operators associate left-to-right except for -> and TABLE.

Except where stated, an identifier on the left of an assignment means that the result of the right hand side is to be stored in the cell named by that identifier, while an identifier in any other location represents the content of the cell it names. Statements like "a is an address" or "the integer a" mean "the value of the expression a, now taken as an address" or "the value of the expression a, now taken as an integer".

a!b

is equivalent to !(a+b).

!a

is the content of the cell whose address is given by a; it can appear on the left of an assignment.

a%b

a is the address of 1 or more contiguous cells; these cells are treated as a character buffer. On the left of an assignment, the bth character in the buffer is set to the least significant BITSPERBYTE bits of the result of the right side. In all other contexts, the value is that of the bth character padded to the left with zeroes (that is, characters are unsigned).

@a

if a is an identifier not declared as a manifest constant, then the address of the cell that that identifier names (@ may only be used in this way or in the following two special ways).

@!a

is equivalent to a.

@a!b

is equivalent to (a+b).

a*b and a/b, a REM b, a+b, +a, a-b, -a

are integer arithmetic operators. Arithmetic overflow is undefined.

a#*b and a#/b, a#+b, #+a, a#-b, #-a

are their floating-point equivalents.

a=b etc.

are integer comparisons, and return TRUE or FALSE.

a<b<=c etc.

is equivalent to a<b & b<=c.

a#=b etc.

are floating-point comparisons with the same rules.

a<<b anda>>b

shift the bit-pattern a left or right b (integer) bits. Zeroes are shifted in, and b must be non-negative.

~a

logical not - see below for the meaning of "logical".

a&b

logical and.

a|b

logical or

a NEQV b

logical xor.

a EQV b

is equivalent to ~(a NEQV b).

a->b,c

if a is true, evaluate b, else evaluate c.

SLCT a and SLCT a:b and SLCT a:b:c,

where a, b, and c are all non-negative constant expressions, and b and c default to 0, is a selector.

a OF b

(SLCT x:y) OF b selects a bitfield from within !b:

on the left side of an assignment, sets x bits of !b to the bottom x bits of the right hand side. The bits set are bits y to y+x-1, where bit 0 is the rightmost bit; if x is 0, they are all bits except the rightmost y.
in all other contexts, yields the bottom x bits of (if x is zero, then is equal to) (!b >> y).

(SLCT x:y:z) OF b is equivalent to (SLCT x:y) OF (b+z).

TABLE k1,k2,...kn, where all the ki are constant expressions,

is the address of n contiguous cells initialized to ki in order. The cells have static storage duration.

VALOF c

where c is a command, causes c to be executed. If, during the execution of c, a RESULTIS command is reached, that controls the value of the expression. Otherwise the value is unspecified.

Logical operations are evaluated according to one of two rules. If the result of the operator is in "truth context", then "truth rules" are used. Otherwise the operator takes bit-patterns as argument(s) and operates on each bit separately. Truth context is:

left side of ->;
outermost operator in the controlling expression of a conditional command;
a direct operand of an operator evaluated using truth rules.

Under truth rules, all operands must be either TRUE or FALSE. For dyadic operators, the left operand is evaluated. If and only if this does not determine the result, then the right operand is evaluated.

TRUE and FALSE have values such that logical operations not using truth rules evaluate correctly to TRUE or FALSE.

Procedure calls

The syntax is:

    e1 ()
    e1 (e2, e3, ..., ex)

and associates from left to right. e1 to ex are all evaluated, and e1 must result in a procedure designator; the procedure is called with the appropriate arguments. If the procedure is a function, the value of the call is that of the evaluated function body; if it is a routine, the value is ?. The number of arguments does not have to match the number of parameters of the procedure. Arguments are passed strictly by value.

Blocks

A block is zero or more declarations followed by one or more commands, all within section brackets. These are separated (not terminated) by semicolons.

Sections

A section is any construction within section brackets. These are $( and $) optionally followed by a tag - which is a sequence of letters, digits, and dots (not necessarily beginning with a letter) - without an intervening space. If the tag of the closing bracket does not match that of the opening one, it also closes any intermediate open section. For example, the following is valid BCPL:

    $(1
        some code
        $(1.1
            some code
        $)1.1
        $(1.2
            some code
            $(
                some code
    $)1 // Also closes 1.2 and the untagged block

Commands

A command is a call, an assignment, a block, or any of the following (ei are expressions and ci are commands):

    c REPEAT
    c REPEATWHILE e
    c REPEATUNTIL e
    IF e THEN c
    UNLESS e DO c
    TEST e THEN c1 ELSE c2
    WHILE e DO c
    UNTIL e DO c
    FOR i = e1 TO e2 BY e3 DO c
    RESULTIS e
    SWITCHON e INTO c
    GOTO e
    FINISH
    RETURN
    BREAK
    LOOP
    ENDCASE

Assignments have the form:

    a1, a2, ..., an := e1, e2, ..., en

where the lists on the two sides must have the same length, and ai must be valid on the left side. The order of the assignments is unspecified; they may be interleaved. Most recent implementations allow the assignment symbol to be prefixed with any dyadic operator, as in:

    a1, a2, a3 <<:= e1, e2, e3

This is equivalent to:

    a1, a2, a3 := a1 << e1, a2 << e2, a3 << e3

except that the expressions a1, a2, a3 are evaluated only once, and the << has lower precedence than any operators in the six expressions.

A call has the same form as in an expression, except that any result is ignored.

The contained command of a REPEAT command is the smallest possible; in particular, in:

    I := VALOF F () REPEATWHILE J

the repeated command is the call to F. The unconditional REPEAT command repeats forever. REPEAT commands always execute the contained command at least once.

In the FOR statement, the BY e3 part is optional (equivalent to BY 1); e3 must be a constant expression. The identifier i is declared as a new dynamic cell whose scope is the contained command. The command is executed zero or more times with the cell taking values e1, e1+e3, e1+2*e3, ... so long as the value is less than (greater if e3 is negative) e2. The expressions are evaluated before the first iteration, and not re-evaluated each time.

In the SWITCHON command, the contained command must be a block which cannot contain declarations, but can contain case labels. These have the form

    CASE e:
    DEFAULT:

where e is a constant expression. Note that case labels can only be in that block, and not in inner blocks (except for nested SWITCHONs).

FINISH terminates the program; RETURN terminates the current procedure (with an undefined result if it is a function); BREAK terminates, and LOOP ends the current iteration of, the textually surrounding WHILE, UNTIL, REPEAT, or FOR. ENDCASE similarly exits the textually surrounding SWITCHON. RESULTIS cannot terminate the current procedure.

Declarations

A program is a sequence of declarations, separated by semicolons. Each declaration declares one or more identifiers. There are two basic kinds of declarations: section declarations and simultaneous declarations. Identifiers declared in section declarations have a scope that starts at the end of that identifier's declaration; identifiers declared in simultaneous declarations have a scope that starts at the first of the set of simultaneous declarations (hence the name). In each case, the scope ends at the end of the block the declaration occurs in, or the end of the file if not in a block.

Each declaration that creates a cell with dynamic storage duration causes a new cell to be created each time the declaration is executed; the cell is (notionally) destroyed when the block containing the declaration terminates. The order that the cells declared in the same simultaneous declaration are initialized is undefined; they may be interleaved.

There are three types of section declarations. In each case, vi stands for an identifier not declared at the current level, and ki for a constant expression.

MANIFEST$(v1 = k1;v2 = k2;v3 = k3; ... ;vn = kn $): Declares vi as a constant whose value is ki.
STATIC $(v1 = k1;v2 = k2;v3 = k3; ... ;vn = kn $): Creates one cell with static storage duration for each identifier; the cell is initialized to the expression before the program starts executing.
GLOBAL$(v1 : k1;v2 : k2;v3 : k3; ... ;vn : kn $): The ki must be non-negative. A block of consecutive cells exists available to all modules of the program, called the global vector. vi designates the ki'th cell of the global vector; enough cells are allocated to cope with all declarations in all modules.

Simultaneous declarations take the form:

    LET d1 AND d2 AND ... AND dn

There are four types of declaration that can be used; they are shown here with LET, but of course can also occur with AND. The types may be freely mixed within one simultaneous declaration.

LET v1, v2, v3, ..., vn = e1, e2, e3, ..., en

One cell with dynamic storage duration is created for each identifier, and initialized to the corresponding expression, which need not be constant.

LET v = VEC k

One cell with dynamic storage duration is created for the identifier. Another k+1 contiguous cells (the original cell is not necessarily contiguous with them) with the same duration are also created, and the first cell is initialized with the address of the first of the other cells.

LET v (v1,v2, ...,vn) BE command LET v (v1,v2, ...,vn) = e

These create a procedure body; the vi are the formal parameters. If this lies within the scope of a GLOBAL declaration for v, then it continues to designate the cell of the global vector. Otherwise it designates a cell with static storage duration. Before the program starts, the cell is initialized to a procedure designator that invokes the appropriate body.

The first form generates a routine (which has no result), and the second a function (which does). In effect, the following are equivalent:

: LETv (v1, v2, ..., vn) BE command
and: LETv (v1,v2, ...,vn) =
VALOF $( command; RESULTIS? $)

The first two forms may only occur within blocks. This isn't a formal constraint written in the BCPL book by Richards and Whitby-Strevens, but I've never seen file-scope LET declarations of variables. In particular, if I wanted a static or global array, I would always declare the array within START, and then assign it to the static or global variable.

The parameters of a function or routine designate contiguous cells, with adjacent parameters being adjacent. It is implementation defined whether the first parameter has the lowest or highest address. It is implementation-defined whether the values of arguments in excess of the number of parameters are stored in further contiguous cells or are discarded.

Functions and routines may be declared inside blocks. A function or routine may not refer to an identifier which designates a cell with dynamic storage duration and is not declared in that function or routine.

Labels

Labels have the form of an identifier followed by a colon; any number of labels may precede a command. Each identifier designates a cell with static storage duration (as with functions and routines, if a GLOBAL declaration of the identifier is in scope, that cell is used) which is initialized, before the program starts, to a jump target referring to the location of the label. A label is in scope within the smallest of:

the commands (but not declarations) of the textually surrounding block
the body of the textually surrounding routine (if any)
the body of the textually surrounding VALOF (if any)
the body of the textually surrounding FOR (if any)

A jump closure is a value which refers to a particular invocation of a function or routine. The closure of the current invocation can be determined by calling the library routine LEVEL(); this value is no longer valid when the invocation terminates.

There are two ways to jump to a label: GOTO t, and calling the library routine LONGJUMP(c,t); t is an expression which evaluates to a jump target, and c is an expression evaluating to a jump closure. The jump must be made either from the scope of the target label, or from a function or routine called (possibly indirectly) from within that scope. In the latter case, c must evaluate to the closure of an invocation of the function containing the label and which is still active.

System Library

This has been built up by consensus. All routines and functions are accessed via the global vector via GLOBAL declarations in the header LIBHDR. The allocation of routines and functions to particular cells is implementation defined, but only cells less than FIRSTFREEGLOBAL are used.

Note that input and output goes via the designators held in the global vector cells designated by RDCH and WRCH. Altering those cells will alter the behaviour of all I/O operations.

LIBHDR contains the manifests:

    BITSPERBYTE
    BYTESPERWORD
    ENDSTREAMCH
    FIRSTFREEGLOBAL

START (implementation-defined parameters)

This is not a library routine, but an identifier of a global vector cell which the program must initialize; the routine designator in that cell is called by the system to execute the program.

character = RDCH ()

Returns the next character from the selected input stream. Returns ENDSTREAMCH at end of file.

WRCH (character)

Writes the character to the selected output stream.

SELECTINPUT (stream.designator)

Selects the designated stream as the new input stream. At program start, the standard input stream is selected.

SELECTOUTPUT (stream.designator)

Selects the designated stream as the new output stream. At program start, the standard output stream is selected.

stream.designator = FINDINPUT (string)

Returns a stream designator for the indicated file or device (read-only).

stream.designator = FINDOUTPUT (string)

Returns a stream designator for the indicated file or device (write-only).

stream.designator = FINDFILE (string)

Returns a stream designator for the indicated file (read-write and seekable).

stream.designator = CURRENTINPUT () stream.designator = CURRENTOUTPUT ()

Return the designator of the selected stream. [I think these are routines, but the designators might just be stored in the global vector; I can't remember.]

[I forget the routines for seeking on a stream.]

integer = READN ()

Reads an integer via RDCH and returns it.

WRITES (string) WRITEN (integer) WRITED (integer, minimum.width) WRITEOCT (integer, minimum.width) WRITEHEX (integer, minimum.width)

Writes the string or integer via WRCH.Padding is done with spaces on the left.

WRITEF (format, args ...)

Does a formatted write via WRCH.The format is a string which can contain:

    %%  call WRCH('%')
    %S  call WRITES   with the next argument
    %C  call WRCH     with the next argument
    %O1 call WRITEOCT with the next argument
    %X1 call WRITEHEX with the next argument
    %I1 call WRITED   with the next argument
    %N  call WRITEN   with the next argument

where 1 can be any single base-36 digit (i.e. A means 10, B means 11, and so on), and is passed as the second argument of the call.

Other formats are implementation-defined. Note that these calls go via the global vector (normally twice, since each then calls WRCH).

character = GETBYTE (s, i)

Equivalent to s%i; used when % is not supported.

PUTBYTE (s, i, c)

Equivalent to s%i:=c; used when % is not supported.

integer = PACKSTRING (v, s)

Equivalent to: FOR i = 0 TO v!0 DO s%i = v!i except that it returns the offset from v of the cell holding s%(v!0).

UNPACKSTRING (s, v)

Equivalent to: FOR i = 0 TO s%0 DO v!i = s%i.

integer = MULDIV (x, y, z)

Returns (x*y/z) evaluated without overflow if the final result fits in a cell.

integer = RANDOM (seed)

Returns a random number generated from the seed. It is common to use the result as the seed for the next call.

closure = LEVEL () LONGJUMP (closure, target)

Described above.

value = APTOVEC (function, n)

Allocates n+1 cells with dynamic storage duration, and then calls the function with two arguments: the address of the first cell, and n. Returns the result of the function.

address = GETBLK (n)

Allocates n+1 contiguous cells with heap storage duration, and returns the address of the first.

FREEBLK (address)

Frees a block of heap cells.

Syntactic sugar

Comments extend either from // or || or \\ to end of line, or from /* or |* or \* to the corresponding close symbol (the two characters reversed); comments do not nest, and all comment symbols other than the required close symbol are ignored in a comment.
If a semicolon is syntactically sensible at the end of a line, and one is not already there, it will be inserted. As a corollary, a line cannot begin with a dyadic operator.
The keywords THEN and DO are interchangeable in all circumstances, as are OR and ELSE. THEN/DO may be omitted if followed by another keyword or a block.
The keyword GET followed by a string constant causes the file indicated by the string to be textually included at this point. The keyword and string (possibly followed by a comment) should form a source line of their own.

Extensions

The following parts of the language are not supported by all compilers:

floating-point
the % operator (but string constants are part of the core language)
the SLCT and :: operators
styles of comment other than //
operator-assignments such as +:=