Date

Recently, I have been working on an open-source tool for producing bitstreams for Xilinx Coolrunner-II CPLDs that I have named xc2par. It is finally at a point where I no longer feel embarrassed to have other people try and use it.

Quick start

The first tool you will need is yosys. You need a version that contains commit 14e49fb05737cfb02217b2aaf14d9f5d9e1859da, but it is recommended to use the latest master branch. Follow the instructions in the yosys README to install it (normally I would point people to my pre-built binaries, but those are currently broken).

Next, you will need to install the compiler for the Rust programming language.

After installing Rust, clone the "openfpga" repository. Change directories into the src/xc2par directory and run cargo build --release. When this completes, the final command-line tool will be in target/release/xc2par.

At this point you are ready to try compiling some HDL! Unfortunately, xc2par does not currently accept exactly the same code as the proprietary Xilinx tools, and there is also currently no documentation on what is or is not accepted. You will have to either guess or read the source code to figure out what currently works. Most notably, .ucf constraint files are not supported (you must use LOC attributes, and they must use the function block and macrocell number rather than a package pin), and automatic insertion of BUFGs is not supported. However, to get started, the following is some example Verilog for an LED blinker:

module top(led0, led1, led2, led3, clk_);

// NOTE: Must use LOC attributes rather than a .ucf file
(* LOC = "FB1_9" *)
output led0;
(* LOC = "FB1_10" *)
output led1;
(* LOC = "FB1_11" *)
output led2;
(* LOC = "FB1_12" *)
output led3;
(* LOC = "FB2_5" *)
input clk_;

// NOTE: Must manually instantiate BUFG
wire clk;
BUFG bufg0 (
    .I(clk_),
    .O(clk),
);

reg [23:0] counter;

assign {led3, led2, led1, led0} = counter[23:20];

always @(posedge clk)
    counter <= counter + 1;

endmodule

To compile this code, use the following commands:

yosys -p "synth_coolrunner2 -json blinky.json" blinky.v
./target/release/xc2par -p xc2c32a-4-vq44 blinky.json

At this point, if everything worked correctly, there will be a blinky.jed file in the same directory as the blinky.json file. It should now be possible to program this file into a CPLD to test it out. However, embarrassingly, there currently isn't any code in the xc2par project that can help with that. You will either need to use

  • ISE/iMPACT (either to generate a .svf file or to program parts directly)
  • Andrew Zonenberg's jtaghal (only supports the 32A and 64A parts)
  • Some other JTAG programmer that can read Xilinx .jed files

The Origins of xc2par

In the beginning, there was only Andrew Zonenberg. Back in 2014, Andrew took some CPLDs and exposed them to a slew of nasty chemicals and some high-energy electrons. This culminated in a talk presented at REcon 2015. At this point in time, the PLA (AND and OR gates) and interconnect were understood but the macrocell configuration was not. Although Andrew had always wanted to write an open-source compiler for these chips, yosys did not have support for sum-of-products architectures at this point in time. This combined with Life™ led to the project being temporarily shelved.

Robert Has Joined the Party

Being a "hacker" with extremely varied interests, I had been lurking Andrew's work for years (since at least 2012). I had an interest in both programmable logic devices and reverse engineering. However, since I was pretty busy with Life™ of my own, I had never considered ever contacting Andrew or collaborating. However, one of Andrew's projects combined with perfect timing changed this.

Quick Detour: Introducing GreenPak

I first heard about these interesting tiny programmable logic devices from an article on Hackaday. An interesting feature of these parts is that the bitstream format is completely documented in the datasheet. After reading the article, I purchased a GreenPak development kit with the idea of making "some kind of domain-specific-language" for programming these parts (I was not ambitious enough to consider supporting a traditional HDL, and I was not yet aware that yosys existed at the time). However, since I was still working on Life™ (being a university student), this project also got shelved.

However, in the meantime, Andrew was working on his open-source Verilog flow for GreenPak. This tool just so happened to be released almost exactly around the time that I finished my undergrad degree. Since I had planned to take a year off to remain "funemployed," I decided to introduce myself to Andrew and joined the ##openfpga IRC community on Freenode somewhere around this time. Andrew eventually suggested that I could work on a place-and-route for Coolrunner-II parts (yosys support for sum-of-products had also gotten added around this time).

Finishing the Reverse Engineering

When Andrew had originally set aside the Coolrunner-II toolchain project back in 2015, the macrocell configuration was not yet understood. Andrew suggested that I ignore the macrocell-related parts and write tooling to handle the other parts of the device first. Macrocells would be configured by copying the entire block of bits from an existing ISE-generated design. Dissatisfied with this answer, I decided to schedule an IRL hackathon with Andrew to figure out all of the remaining bits. This resulted in the following:

Whiteboard drawing of Coolrunner-II macrocell

Note that this diagram is not accurate and has errors. A more accurate diagram needs to be drawn eventually.

Now with all the bits understood, we could begin writing tooling that did not require any strange hacks.

xc2bit and xc2par are Born

At the end of May 2017, the first commit of the "xc2bit" code is created. This is a Rust library for performing low-level manipulations of Coolrunner-II bitstreams. This is analogous to Andrew's never-finished libcrowbar library. The first commit of xc2par is created a few days later.

First Attempt: xbpar

The first version of xc2par was supposed to use xbpar, Andrew's generic C++ framework for writing place-and-route tools for "crossbar interconnect" architectures. The internals of xbpar are described in Andrew's blog post about the GreenPak Verilog tooling, but the general idea is that the algorithm takes as input two graphs, one modeling the target device and one modeling the user's design, and attempts to check if the user's design graph is a subgraph of the device structure graph. It does this by attempting to "pair up" nodes of the two graphs and then checking if the edges between the nodes of the design graph exist in the device graph. If any edges do not, it attempts to find something better using simulated annealing.

Trying to use this library for xc2par had a number of issues. The first issue was "ideological." I wanted to use Rust for xc2par because I did not like C++ but wanted more powerful libraries and abstractions than are available in C. Unfortunately, xbpar is a C++ library that isn't very friendly to being interfaced with a different programming language. Among other things, device-specific functionality is implemented by deriving from an xbpar base class and overriding some methods. However, I persevered and attempted to write glue code between the two languages. Unfortunately, all of this glue code ended up being more lines of code than xbpar itself.

A second issue was that the specific mechanism I had used for describing graphs for xbpar caused extremely poor behavior. I had described every "enable-able or disable-able" feature of the macrocell as an individual node in xbpar (so the XOR gate of a macrocell would be a node, the register would be a different node, and the IO pad would be yet another different node). However, this would cause the following behavior:

  1. The greedy initial placement algorithm packs together "the wrong" set of "macrocell pieces." For example, it would place the IO pad for pin a and the XOR gate for pin b both at function block 1 macrocell 1. Meanwhile, the XOR gate for pin a and the IO pad for pin b both get placed at function block 1 macrocell 2.
  2. The algorithm would notice that there are not proper edges between the nodes. In the above example, there is an edge in the device graph between the XOR and IO pad of FB1_1, but there are no edges between the IO pad of FB1_1 with the XOR gate of FB1_2 or between the XOR gate of FB1_1 with the IO pad of FB1_2. Therefore, all components of pins a and b will get added to the list of nodes contributing to the bad scores.
  3. The simulated annealing step picks one of the "bad" nodes and tries to move it. It ends up moving e.g. the XOR gate of one pin to a completely different random spot. This does not improve the situation since the XOR gate needs to be in the same location as all of the other "macrocell bits" for the same pin (but it has just been moved somewhere random and unrelated).
  4. The cycle repeats without ever making progress.

One possible solution for this problem is "why not just model the entire macrocell as one giant node with a bunch of attributes?" This solution might work, but it has difficulties with optimal packing of buried nodes. The Coolrunner-II architecture has the following quirk:

  • A macrocell that needs to output to a pin cannot share a macrocell site with anything else
  • A buried combinatorial feedback node can share a macrocell site with both a direct pin input as well as a pin input that goes through the register
  • A buried registered feedback node can share a macrocell site with a direct input pin only, and it can only share this site if the registered feedback node is not also simultaneously a combinatorial feedback node.

It was not clear to me how to express this quirk in a way that was actually useful at fixing the above behavior. Since this also required me to write a lot of device-specific code, I was wondering if I could somehow avoid having to do that.

Second Attempt: SAT/SMT

While I was encountering the above problems, I discussed them with one of my housemates. They suggested that I should try feeding the place-and-route problem into a SAT/SMT solver. The idea behind this was that:

  • Modern SAT/SMT solvers have quite a few advanced algorithms in them
  • The problem size seemed "small"

With this suggestion, I quickly hacked together a naive lowering of the place-and-route problem into a SMT problem. Since I added this hack into xbpar, it was possible to test it for both GreenPak designs and Coolrunner-II designs.

Unfortunately, although I did not keep very detailed measurements, this approach did not work very well. A major problem was that this was my first attempt at using SMT solvers. Firstly, I selected the QF_LIA theory because I was expecting that I might use it to eventually implement timing-driven place-and-route. Unfortunately, this restricted me to using the SMT solvers that tended to be relatively slower. Secondly, the particular lowering that I chose was relatively naive and probably did not help the SMT solver to try to optimize the problem. Overall, the results were that Coolrunner-II designs never completed in a reasonable amount of time. GreenPak designs would complete relatively quickly if they did indeed fit in the device (solver returns SAT) but would take forever if they did not actually fit in the device (solver returns UNSAT).

Second-and-a-Half Attempt: Naive backtracking search

Since using a SMT solver did not appear to be working, I decided to try other "brute-force" techniques. As part of this, I wrote a naive backtracking search solver to assign design graph nodes to device graph nodes. I implemented only the "min-remaining-values" optimization. This actually completed incredibly quickly for GreenPak designs that fit. However, GreenPak designs that did not fit were still somewhat slow, and Coolrunner-II designs still did not work at all.

Since this solver was a completely standalone program and written in Python, I could quickly add hacks to it and observe how it behaved. This ended up giving me the following insights:

  • There are symmetries or "shortcuts" that naive solvers cannot easily take advantage of (even with methods like forward checking or arc consistency). For example, every product term in a PLA is accessible to every OR gate, and so it does not "really matter" where these product terms are placed. However, backtracking search cannot easily discover this. The same goes for the ZIA - every ZIA row mux output is accessible to every product term, and it does not matter which specific rows are chosen.
  • Following on from the above, the only placements that "really matter" are the placements of the macrocells. Everything else can be "mostly" done with greedy assignments. (However, note the scare quotes. Read below a less simplified explanation)
  • Backtracking search cannot be used even for just placing macrocells. An approximation must be used here. This is because, at worst, backtracking search can take exponential time and explore every single possibility. If the algorithm happens to be working on the XC2C512 and is trying to place every macrocell into every possible site, this is \(512!\approx2^{3875}\) possibilities. Even if we try to get a much lower bound by pretending that all macrocells in the user design are identical and can be freely swapped, we still have too many worst-case possibilities. To count how many, we would like to know the "number of ways to put identical (unlabeled) objects into labeled bins, where all of the bins have a maximum size." Unfortunately my combinatorics is a bit rusty, but @eggleroy managed to help me out with the following formula:
    $$ f(N, X, S) = \begin{cases} 0 & \mbox{if } N < 0\\ 1 & \mbox{if } N = 0\\ 0 & \mbox{if } N > 0, X = 0\\ \sum_{k=0}^S f(N-k, X-1, S) & \mbox{otherwise} \end{cases} $$
    where \(N\) is the number of objects (macrocells), \(X\) is the number of bins (function blocks), and \(S\) is the maximum size of each bin (macrocells per function block). \(N\) depends on the size of the user design while \(X\) depends on the specific CPLD device chosen. \(S\) is fixed for all devices in the family. Somewhat unintuitively (at least to me initially), this function is maximized (for our purposes) when \(N=256, X=32, S=16\). Computing this yields \(33926222943232064651934548544483314385 \approx 2^{125}\).

If anybody who is more familiar with SAT/SMT solvers would ever like to continue to play with these ideas, there is some code here.

Detour: xc2jed2json

While working on reverse engineering efforts, I wrote a tool that can "decompile" Coolrunner-II bitstreams back into a .json netlist that can be loaded into yosys. After some (not yet ready for production) steps, this can be turned into something vaguely understandable. Andrew has given a talk about this. More on this in the future.

Final Attempt: Explicitly Modeling Shortcuts

At this point I was rather frustrated with working with xbpar. Given the insights I obtained from the backtracking search implementation, I decided to write a completely new place-and-route tool. This tool explicitly models all of the "shortcuts" that can be taken within an individual function block but still uses heuristics for placing macrocells. The algorithm works mostly as follows:

  1. There is a subroutine that takes as input an assignment of macrocells to macrocell sites and outputs a placement of product terms and a map of ZIA usage. For each function block, this subroutine does:
    1. Finds all of the product terms referenced by macrocells in this function block. The algorithm knows about two categories of product terms: those that have some kind of constraint on their placement (because they go into the dedicated product terms (either the per-macrocell PTA/B/C or the per-function-block CTx control terms) or because they have an explicit LOC constraint), and those that have no such constraints and just go into the OR array. The "shortcut" being taken here is that only those product terms that have a constraint need to be specifically handled. Everything else can be greedily assigned.
    2. For the first type of product terms (those that have restrictions on their placement), store all the candidate locations for each of them. Possibly filter by the explicit LOC constraints.
    3. Use backtracking search to assign the locations of these product terms. If it does not succeed, return an error. This backtracking search is much smaller because macrocells will almost always use the per-macrocell PTA/B/C terms. The only terms that can be contended are the per-function-block CTx control terms. There are only 4 of them, and at worst the algorithm can try assigning one product term from each macrocell into each of the control terms. This is \(16^4=2^{16}\) which can complete almost instantly.
    4. Take the remaining product terms (that have no restrictions on their placement) and assign each one to the first open site. If there are no open sites left, return an error.
    5. Independently of the product term placement logic, gather up all inputs into the product terms. The ZIA muxes need to be programmed to provide these as inputs into the function block. Because we are starting with a complete macrocell placement, we know which specific choices we want to search for in the ZIA.
    6. Search the ZIA data table for all of the rows that can possibly give each output. Then use backtracking search to pick a feasible assignment. I do not currently have a complexity estimate for this step, but the ZIA was originally designed such that "almost all" combinations should be able to be routed. The "shortcut" here (that was also the insight originally used by Xilinx to design the ZIA) is that it does not matter which row of the ZIA is used to obtain each specific input. It only matters that the entire set of inputs is somehow present (because AND is commutative).
  2. The actual PAR tool first reads a yosys .json file and parses it.
  3. We create a data structure consisting of "nodes" and "nets" connecting them. The nodes are of a type such as "AND gate," "OR gate," "register," or "XOR gate," but they are not distinguished from each other (they are all a generic "node" data type). Net objects store their source and sink nodes, but nets can refer to any nodes whatsoever. Essentially, this step takes the yosys data model and filters/parses the individual cell types without any checking of connectivity between them. It turned out that this data structure was too difficult to work with in subsequent steps, so there is also a second data structure.
  4. The second data structure no longer models nets. Instead, there are multiple distinguished types of nodes that directly refer to each other. For example, a node corresponding to an OR gate will directly refer to the AND gates that feed it. All of the functionality related to macrocells is also packed into a large "macrocell" node with subcomponents for "the register part," "the IO part," and "the combinatorial (XOR) part." This step does most of the checking for proper connectivity.
  5. Macrocells are initially placed with a greedy placement algorithm. One observation from earlier experiments was that most simple designs can be placed with just greedy placement.
  6. Run the subroutine described in 1. If it succeeds, the place-and-route is done!
  7. If it did not succeed, something needs to be improved. Instead of using simulated annealing, xc2par uses "min-conflicts." When the subroutine described in part 1 fails, it also tries to calculate which macrocells in the placement contribute most to the failing. To do so, it simply brute-force deassigns each macrocell in the design and checks how many conflicts are left. The min-conflicts algorithm then chooses a "bad" macrocell to move weighted by the number of conflicts each macrocell causes.
  8. Once a macrocell to move is chosen, the algorithm tries every location in the device and tries swapping the macrocell into that location (ignoring those that have LOC constraints on them). The algorithm is looking for a site that improves the number of conflicts the most.
  9. If such a site is found, commit to the given swap. If not, randomly perform a swap. If an iteration limit is exceeded, give up.

After this is complete, xc2par then outputs a programming file.

Future work

Since this is an early prototype, there is always room for improvement. Hop into ##openfpga on Freenode to discuss and report bugs.

Notable areas needing improvement are:

  • More tests
  • Support for IO standards (currently hardcoded)
  • Directly generating .svf files (or other means of programming chips)
  • Ability to use the XOR-with-PTC functionality more effectively
  • Extra passes to handle Xilinx-isms (e.g. things described in the "Xilinx CPLD Libraries Guide")
  • Devices other than the XC2C32/A