Unverified Commit 5b4cf5df by Pasquale Cocchini Committed by GitHub

[VTA][Chisel,de10nano] Chisel fixes and de10nano support (#4986)

* [VTA][de10nano] Enable user defined target frequency.

Issue:
The VTA target frequency on the DE10-Nano is hardcoded to 50MHz
unnecessarily limiting performance.

Solution:
Add a PLL to the FPGA sub-system along with support for the
selection of a user specified frequency at build time. The board
successfully builds and runs at 100MHz.

* Added a PLL in the soc_system.tcl platform designer generator
  script.

* Modified the Makefile to automatically set the target frequency
  from that specified in the pkg_config.py file.

* Modified the Makefile to generate a bitstream with an RBF
  format that enables programming of the FPGA directly from
  the on-board processor. Specifically, the RBF is generated in
  FastParallel32 mode with compression, which corresponds to the
  default MSEL switch setting on the board, i.e. 01010.

* Added a false path override to file set_clocks.sdc to turn off
  unconstrained path warnings on the VTA pulse LED.

* [VTA][TSIM] Add more debug and tracing options.

* Modified Makefile to change default config to DafaultDe10Config.

* Added option in Makefile to produce more detailed tracing
  for extra observability in debugging complex scenarios.

* Added option in Makefile to produce traces in FST format which
  are 2 orders of magnitude smaller, although much slower to
  generate.

* Added option in Makefile to build the simulator with GCC address
  sanitizer.

* Modified Makefile to not lint the scala code by default avoiding
  unintended wrong indentation. Linting should be better performed
  manually on a per-need basis.

* [VTA][de10nano] Enable remote programming of FPGA.

Issue:
The Cyclone V FPGA on board of the DE10-Nano can only be programmed
using the JTAG port, which is a limiting option for users.

Solution:
Add support for the remote programming of the FPGA implementing
the FPGA programming manager protocol published in the Cyclone V
user manual.

* Added file de10nano_mgr.h implementing an FPGA manager class
  that supports handling of control and status registers as well
  as a push-button option to program the FPGA. The class can be
  easily extended to include more registers if needed.

* Used an instance of the FPGA manager to implement function
  VTAProgram also warning users when incompatible bitstream
  files are used.

* Registered VTAProgram as a global function and modified
  the program_bitstream python class to use it.

* [VTA][de10nano] Enhance de10nano runtime support.

Issue:
The de10nano target has incomplete, non-working support
for runtime reconfiguration, bitstream programming, and
examples of usage.

Solution:
Complete runtime support for the de10nano target.

* Modified VTA.cmake to comment out a default override for
  VTA_MAX_XFER to 21 bit wide.

* Modified VTA.cmake to add needed de10nano include dirs.

* Modified relevant files to support de10nano same way as
  other targets for VTA runtime reconfiguration and FPGA
  programming.

* Added test_program_rpc.py example as a runtime FPGA
  programming example. Note that unlike the pynq target
  no bitstream is either downloaded or programmed when
  the bitstream argument is set to None.

* Cosmetic changes to vta config files.

* [VTA][Chisel] LoadUop FSM bug fix.

Issue:
The LoadUop FSM incorrectly advances the address of the next
uop to read from DRAM when the DRAM data valid bit is deasserted
and asserted at the end of a read. This is caused by a mismatch
in the logic of the state and output portions of the FSM.
This is one of two issues that was gating the correct operation
of VTA on the DE10-Nano target.

Solution:
Modify the logic of the output section of the FSM to include
a check on the DRAM read valid bit or fold the output assignemnt
into the state section.

* Folded the assignemnt of the next uop address in the state
  section of the FSM.

* [VTA][Chisel] Dynamically adjust DMA tranfer size.

Issue:
In the DE10-Nano target and possibly in others, DMA transfers that
cross the boundaries of memory pages result in incorrect reads and
writes from and to DRAM. When this happens depending on different
input values, VTA loads and stores exhibit incorrect results for
DMA pulses at the end of a transfer. This is one of two issues that
were gating the DE10-Nano target from functioning correctly, but may
affect other Chisel based targets.

Solution:
Add support for dynamically adjustble DMA transfer sizes in load
and store operations. For a more elegant and modular implementation
the feature can be enabled at compile time with a static constant
that can be passed as a configuration option.

* Modified the load and store finite state machines to dynamically
  adjust the size of initial and stride DMA transfers. The feature
  is enabled by default by virtue of the static constant
  ADAPTIVE_DMA_XFER_ENABLE.

* [VTA][Chisel] Improve FSIM/TSIM/FPGA xref debug.

Issue:
Cross reference between FSIM, TSIM, and Chisel based FPGA traces
is an invaluable instrument that enables fast analysis on FSIM,
and analysis/debug on TSIM and FPGA, especially for complex flows
like conv2d or full inferences. Currently this cannot be done
easily since a suitable reference is missing. The clock cycle
event counter cannot be used since it is undefined in FSIM and
not reliable between TSIM and FPGA because of different latencies.

Solution:
Introduce a new event counter that preserves a program order across
FSIM, TSIM, FPGA. We propose adding the accumulator write event
counter in the Chisel EventCounter class and a simple instrumentation
in the FSIM runtime code. Note that this technique enabled finding the
Chisel issues reportes in the PR, which would have been otherwise
far more difficult.

* Added the acc_wr_count event counter and changed interfaces
  accordingly.

* [VTA][de10nano] Comply with linting rules.

* [VTA] Appease make lint.

* [VTA] Disable pylint import not top level error.

* [VTA][Chisel,de10nano] Linting changes.

* Use CamelCase class names.

* Use C++ style C include header files.

* Add comments to Chisel makefile.

* [VTA][de10nano]

* Reorder C and C++ includes in de10nano_mgr.h.

* Restore lint as default target in Chisel Makefile.

* [VTA][de10nano] Do not use f string in pkg_config.py.

* [VTA][de10nano] Remove overlooked f strings in pkg_config.py.

* [VTA][de10nano] Fixed typo.

* [VTA][TSIM] Check if gcc has align-new.

* [VTA][Chisel] Make adaptive DMA transfer default.

* [VTA][RPC] Renamed VTA_PYNQ_RPC_* to VTA_RPC_*.

Issue:
With more FPGA targets coming online the initial method of
using individual environment variables to specify target IP and port
does not scale well.

Solution:
Use a single VTA_RPC_HOST, VTA_RPC_PORT pair to be changed
every time a different target is used. For instance in a script
used to benchmark all targets.

* Replaced every instance of VTA_PYNQ_RPC_HOST and VTA_PYNQ_RPC_PORT
  with VTA_RPC_HOST and VTA_RPC_PORT, respectively.

* [VTA][Chisel] Comply with new linter.
parent 78fa1d5e
...@@ -101,7 +101,9 @@ elseif(PYTHON) ...@@ -101,7 +101,9 @@ elseif(PYTHON)
${VTA_TARGET} STREQUAL "ultra96") ${VTA_TARGET} STREQUAL "ultra96")
target_link_libraries(vta ${__cma_lib}) target_link_libraries(vta ${__cma_lib})
elseif(${VTA_TARGET} STREQUAL "de10nano") # DE10-Nano rules elseif(${VTA_TARGET} STREQUAL "de10nano") # DE10-Nano rules
target_compile_definitions(vta PUBLIC VTA_MAX_XFER=2097152) # (1<<21) #target_compile_definitions(vta PUBLIC VTA_MAX_XFER=2097152) # (1<<21)
target_include_directories(vta PUBLIC vta/src/de10nano)
target_include_directories(vta PUBLIC 3rdparty)
target_include_directories(vta PUBLIC target_include_directories(vta PUBLIC
"/usr/local/intelFPGA_lite/18.1/embedded/ds-5/sw/gcc/arm-linux-gnueabihf/include") "/usr/local/intelFPGA_lite/18.1/embedded/ds-5/sw/gcc/arm-linux-gnueabihf/include")
endif() endif()
......
...@@ -146,8 +146,8 @@ Tips regarding the Pynq RPC Server: ...@@ -146,8 +146,8 @@ Tips regarding the Pynq RPC Server:
Before running the examples on your development machine, you'll need to configure your host environment as follows: Before running the examples on your development machine, you'll need to configure your host environment as follows:
```bash ```bash
# On the Host-side # On the Host-side
export VTA_PYNQ_RPC_HOST=192.168.2.99 export VTA_RPC_HOST=192.168.2.99
export VTA_PYNQ_RPC_PORT=9091 export VTA_RPC_PORT=9091
``` ```
In addition, you'll need to edit the `vta_config.json` file on the host to indicate that we are targeting the Pynq platform, by setting the `TARGET` field to `"pynq"`. In addition, you'll need to edit the `vta_config.json` file on the host to indicate that we are targeting the Pynq platform, by setting the `TARGET` field to `"pynq"`.
......
...@@ -7,7 +7,7 @@ ...@@ -7,7 +7,7 @@
"LOG_BATCH" : 0, "LOG_BATCH" : 0,
"LOG_BLOCK" : 4, "LOG_BLOCK" : 4,
"LOG_UOP_BUFF_SIZE" : 15, "LOG_UOP_BUFF_SIZE" : 15,
"LOG_INP_BUFF_SIZE" :15, "LOG_INP_BUFF_SIZE" : 15,
"LOG_WGT_BUFF_SIZE" : 18, "LOG_WGT_BUFF_SIZE" : 18,
"LOG_ACC_BUFF_SIZE" : 17 "LOG_ACC_BUFF_SIZE" : 17
} }
...@@ -7,7 +7,7 @@ ...@@ -7,7 +7,7 @@
"LOG_BATCH" : 0, "LOG_BATCH" : 0,
"LOG_BLOCK" : 4, "LOG_BLOCK" : 4,
"LOG_UOP_BUFF_SIZE" : 15, "LOG_UOP_BUFF_SIZE" : 15,
"LOG_INP_BUFF_SIZE" :15, "LOG_INP_BUFF_SIZE" : 15,
"LOG_WGT_BUFF_SIZE" : 18, "LOG_WGT_BUFF_SIZE" : 18,
"LOG_ACC_BUFF_SIZE" : 17 "LOG_ACC_BUFF_SIZE" : 17
} }
...@@ -7,7 +7,7 @@ ...@@ -7,7 +7,7 @@
"LOG_BATCH" : 0, "LOG_BATCH" : 0,
"LOG_BLOCK" : 4, "LOG_BLOCK" : 4,
"LOG_UOP_BUFF_SIZE" : 15, "LOG_UOP_BUFF_SIZE" : 15,
"LOG_INP_BUFF_SIZE" :15, "LOG_INP_BUFF_SIZE" : 15,
"LOG_WGT_BUFF_SIZE" : 18, "LOG_WGT_BUFF_SIZE" : 18,
"LOG_ACC_BUFF_SIZE" : 17 "LOG_ACC_BUFF_SIZE" : 17
} }
...@@ -32,16 +32,36 @@ ifeq (, $(VERILATOR_INC_DIR)) ...@@ -32,16 +32,36 @@ ifeq (, $(VERILATOR_INC_DIR))
endif endif
endif endif
CONFIG = DefaultPynqConfig CONFIG = DefaultDe10Config
TOP = VTA TOP = VTA
TOP_TEST = Test TOP_TEST = Test
BUILD_NAME = build BUILD_NAME = build
# Set USE_TRACE = 1 to generate a trace during simulation.
USE_TRACE = 0 USE_TRACE = 0
# With USE_TRACE = 1, default trace format is VCD.
# Set USE_TRACE_FST = 1 to use the FST format.
# Note that although FST is around two orders of magnitude smaller than VCD
# it is also currently much slower to produce (verilator limitation). But if
# you are low on disk space it may be your only option.
USE_TRACE_FST = 0
# With USE_TRACE = 1, USE_TRACE_DETAILED = 1 will generate traces that also
# include non-interface internal signal names starting with an underscore.
# This will significantly increase the trace size and should only be used
# on a per need basis for difficult debug problems.
USE_TRACE_DETAILED = 0
USE_THREADS = $(shell nproc) USE_THREADS = $(shell nproc)
VTA_LIBNAME = libvta_hw VTA_LIBNAME = libvta_hw
UNITTEST_NAME = all UNITTEST_NAME = all
CXX = g++ CXX = g++
# A debug build with DEBUG = 1 is useful to trace the simulation with a
# debugger.
DEBUG = 0 DEBUG = 0
# With DEBUG = 1, SANITIZE = 1 turns on address sanitizing to verify that
# the verilator build is sane. To be used if you know what you are doing.
SANITIZE = 0
CXX_MAJOR := $(shell $(CXX) -dumpversion | sed 's/\..*//')
CXX_HAS_ALIGN_NEW := $(shell [ $(CXX_MAJOR) -ge 7 ] && echo true)
config_test = $(TOP_TEST)$(CONFIG) config_test = $(TOP_TEST)$(CONFIG)
vta_dir = $(abspath ../../) vta_dir = $(abspath ../../)
...@@ -61,11 +81,15 @@ verilator_opt += -Mdir ${verilator_build_dir} ...@@ -61,11 +81,15 @@ verilator_opt += -Mdir ${verilator_build_dir}
verilator_opt += -I$(chisel_build_dir) verilator_opt += -I$(chisel_build_dir)
ifeq ($(DEBUG), 0) ifeq ($(DEBUG), 0)
cxx_flags = -O2 -Wall cxx_flags = -O2 -Wall -fvisibility=hidden
else else
cxx_flags = -O0 -g -Wall cxx_flags = -O0 -g -Wall
endif endif
cxx_flags += -fvisibility=hidden -std=c++11
cxx_flags += -std=c++11 -Wno-maybe-uninitialized
ifeq ($(CXX_HAS_ALIGN_NEW),true)
cxx_flags += -faligned-new
endif
cxx_flags += -DVL_TSIM_NAME=V$(TOP_TEST) cxx_flags += -DVL_TSIM_NAME=V$(TOP_TEST)
cxx_flags += -DVL_PRINTF=printf cxx_flags += -DVL_PRINTF=printf
cxx_flags += -DVL_USER_FINISH cxx_flags += -DVL_USER_FINISH
...@@ -82,13 +106,33 @@ cxx_flags += -I$(tvm_dir)/3rdparty/dlpack/include ...@@ -82,13 +106,33 @@ cxx_flags += -I$(tvm_dir)/3rdparty/dlpack/include
ld_flags = -fPIC -shared ld_flags = -fPIC -shared
ifeq ($(SANITIZE), 1)
ifeq ($(DEBUG), 1)
cxx_flags += -fno-omit-frame-pointer -fsanitize=address -fsanitize-recover=address
ld_flags += -fno-omit-frame-pointer -fsanitize=address -fsanitize-recover=address
endif
endif
cxx_objs = $(verilator_build_dir)/verilated.o $(verilator_build_dir)/verilated_dpi.o $(verilator_build_dir)/tsim_device.o cxx_objs = $(verilator_build_dir)/verilated.o $(verilator_build_dir)/verilated_dpi.o $(verilator_build_dir)/tsim_device.o
ifneq ($(USE_TRACE), 0) ifneq ($(USE_TRACE), 0)
verilator_opt += --trace
cxx_flags += -DVM_TRACE=1 cxx_flags += -DVM_TRACE=1
ifeq ($(USE_TRACE_FST), 1)
cxx_flags += -DVM_TRACE_FST
verilator_opt += --trace-fst
else
verilator_opt += --trace
endif
ifeq ($(USE_TRACE_DETAILED), 1)
verilator_opt += --trace-underscore --trace-structs
endif
ifeq ($(USE_TRACE_FST), 1)
cxx_flags += -DTSIM_TRACE_FILE=$(verilator_build_dir)/$(TOP_TEST).fst
cxx_objs += $(verilator_build_dir)/verilated_fst_c.o
else
cxx_flags += -DTSIM_TRACE_FILE=$(verilator_build_dir)/$(TOP_TEST).vcd cxx_flags += -DTSIM_TRACE_FILE=$(verilator_build_dir)/$(TOP_TEST).vcd
cxx_objs += $(verilator_build_dir)/verilated_vcd_c.o cxx_objs += $(verilator_build_dir)/verilated_vcd_c.o
endif
else else
cxx_flags += -DVM_TRACE=0 cxx_flags += -DVM_TRACE=0
endif endif
......
...@@ -45,6 +45,7 @@ class Compute(debug: Boolean = false)(implicit p: Parameters) extends Module { ...@@ -45,6 +45,7 @@ class Compute(debug: Boolean = false)(implicit p: Parameters) extends Module {
val wgt = new TensorMaster(tensorType = "wgt") val wgt = new TensorMaster(tensorType = "wgt")
val out = new TensorMaster(tensorType = "out") val out = new TensorMaster(tensorType = "out")
val finish = Output(Bool()) val finish = Output(Bool())
val acc_wr_event = Output(Bool())
}) })
val sIdle :: sSync :: sExe :: Nil = Enum(3) val sIdle :: sSync :: sExe :: Nil = Enum(3)
val state = RegInit(sIdle) val state = RegInit(sIdle)
...@@ -125,6 +126,7 @@ class Compute(debug: Boolean = false)(implicit p: Parameters) extends Module { ...@@ -125,6 +126,7 @@ class Compute(debug: Boolean = false)(implicit p: Parameters) extends Module {
tensorAcc.io.tensor.rd.idx <> Mux(dec.io.isGemm, tensorGemm.io.acc.rd.idx, tensorAlu.io.acc.rd.idx) tensorAcc.io.tensor.rd.idx <> Mux(dec.io.isGemm, tensorGemm.io.acc.rd.idx, tensorAlu.io.acc.rd.idx)
tensorAcc.io.tensor.wr <> Mux(dec.io.isGemm, tensorGemm.io.acc.wr, tensorAlu.io.acc.wr) tensorAcc.io.tensor.wr <> Mux(dec.io.isGemm, tensorGemm.io.acc.wr, tensorAlu.io.acc.wr)
io.vme_rd(1) <> tensorAcc.io.vme_rd io.vme_rd(1) <> tensorAcc.io.vme_rd
io.acc_wr_event := tensorAcc.io.tensor.wr.valid
// gemm // gemm
tensorGemm.io.start := state === sIdle & start & dec.io.isGemm tensorGemm.io.start := state === sIdle & start & dec.io.isGemm
......
...@@ -111,6 +111,8 @@ class Core(implicit p: Parameters) extends Module { ...@@ -111,6 +111,8 @@ class Core(implicit p: Parameters) extends Module {
ecounters.io.launch := io.vcr.launch ecounters.io.launch := io.vcr.launch
ecounters.io.finish := compute.io.finish ecounters.io.finish := compute.io.finish
io.vcr.ecnt <> ecounters.io.ecnt io.vcr.ecnt <> ecounters.io.ecnt
io.vcr.ucnt <> ecounters.io.ucnt
ecounters.io.acc_wr_event := compute.io.acc_wr_event
// Finish instruction is executed and asserts the VCR finish flag // Finish instruction is executed and asserts the VCR finish flag
val finish = RegNext(compute.io.finish) val finish = RegNext(compute.io.finish)
......
...@@ -44,6 +44,8 @@ class EventCounters(debug: Boolean = false)(implicit p: Parameters) extends Modu ...@@ -44,6 +44,8 @@ class EventCounters(debug: Boolean = false)(implicit p: Parameters) extends Modu
val launch = Input(Bool()) val launch = Input(Bool())
val finish = Input(Bool()) val finish = Input(Bool())
val ecnt = Vec(vp.nECnt, ValidIO(UInt(vp.regBits.W))) val ecnt = Vec(vp.nECnt, ValidIO(UInt(vp.regBits.W)))
val ucnt = Vec(vp.nUCnt, ValidIO(UInt(vp.regBits.W)))
val acc_wr_event = Input(Bool())
}) })
val cycle_cnt = RegInit(0.U(vp.regBits.W)) val cycle_cnt = RegInit(0.U(vp.regBits.W))
when(io.launch && !io.finish) { when(io.launch && !io.finish) {
...@@ -53,4 +55,13 @@ class EventCounters(debug: Boolean = false)(implicit p: Parameters) extends Modu ...@@ -53,4 +55,13 @@ class EventCounters(debug: Boolean = false)(implicit p: Parameters) extends Modu
} }
io.ecnt(0).valid := io.finish io.ecnt(0).valid := io.finish
io.ecnt(0).bits := cycle_cnt io.ecnt(0).bits := cycle_cnt
val acc_wr_count = Reg(UInt(vp.regBits.W))
when (!io.launch || io.finish) {
acc_wr_count := 0.U
}.elsewhen (io.acc_wr_event) {
acc_wr_count := acc_wr_count + 1.U
}
io.ucnt(0).valid := io.finish
io.ucnt(0).bits := acc_wr_count
} }
...@@ -112,11 +112,14 @@ class LoadUop(debug: Boolean = false)(implicit p: Parameters) extends Module { ...@@ -112,11 +112,14 @@ class LoadUop(debug: Boolean = false)(implicit p: Parameters) extends Module {
when(xcnt === xlen) { when(xcnt === xlen) {
when(xrem === 0.U) { when(xrem === 0.U) {
state := sIdle state := sIdle
}.elsewhen(xrem < xmax) { }.otherwise {
raddr := raddr + xmax_bytes
when(xrem < xmax) {
state := sReadCmd state := sReadCmd
xlen := xrem xlen := xrem
xrem := 0.U xrem := 0.U
}.otherwise { }
.otherwise {
state := sReadCmd state := sReadCmd
xlen := xmax - 1.U xlen := xmax - 1.U
xrem := xrem - xmax xrem := xrem - xmax
...@@ -125,6 +128,7 @@ class LoadUop(debug: Boolean = false)(implicit p: Parameters) extends Module { ...@@ -125,6 +128,7 @@ class LoadUop(debug: Boolean = false)(implicit p: Parameters) extends Module {
} }
} }
} }
}
// read-from-dram // read-from-dram
val maskOffset = VecInit(Seq.fill(M_DRAM_OFFSET_BITS)(true.B)).asUInt val maskOffset = VecInit(Seq.fill(M_DRAM_OFFSET_BITS)(true.B)).asUInt
...@@ -134,8 +138,6 @@ class LoadUop(debug: Boolean = false)(implicit p: Parameters) extends Module { ...@@ -134,8 +138,6 @@ class LoadUop(debug: Boolean = false)(implicit p: Parameters) extends Module {
}.otherwise { }.otherwise {
raddr := (io.baddr | (maskOffset & (dec.dram_offset << log2Ceil(uopBytes)))) - uopBytes.U raddr := (io.baddr | (maskOffset & (dec.dram_offset << log2Ceil(uopBytes)))) - uopBytes.U
} }
}.elsewhen(state === sReadData && xcnt === xlen && xrem =/= 0.U) {
raddr := raddr + xmax_bytes
} }
io.vme_rd.cmd.valid := state === sReadCmd io.vme_rd.cmd.valid := state === sReadCmd
......
...@@ -72,7 +72,6 @@ class AluReg(implicit p: Parameters) extends Module { ...@@ -72,7 +72,6 @@ class AluReg(implicit p: Parameters) extends Module {
/** Vector of pipeline ALUs */ /** Vector of pipeline ALUs */
class AluVector(implicit p: Parameters) extends Module { class AluVector(implicit p: Parameters) extends Module {
val aluBits = p(CoreKey).accBits
val io = IO(new Bundle { val io = IO(new Bundle {
val opcode = Input(UInt(C_ALU_OP_BITS.W)) val opcode = Input(UInt(C_ALU_OP_BITS.W))
val acc_a = new TensorMasterData(tensorType = "acc") val acc_a = new TensorMasterData(tensorType = "acc")
......
...@@ -103,8 +103,7 @@ class TensorLoad(tensorType: String = "none", debug: Boolean = false)( ...@@ -103,8 +103,7 @@ class TensorLoad(tensorType: String = "none", debug: Boolean = false)(
state := sXPad1 state := sXPad1
}.elsewhen(dec.ypad_1 =/= 0.U) { }.elsewhen(dec.ypad_1 =/= 0.U) {
state := sYPad1 state := sYPad1
} }.otherwise {
.otherwise {
state := sIdle state := sIdle
} }
}.elsewhen(dataCtrl.io.stride) { }.elsewhen(dataCtrl.io.stride) {
...@@ -198,11 +197,9 @@ class TensorLoad(tensorType: String = "none", debug: Boolean = false)( ...@@ -198,11 +197,9 @@ class TensorLoad(tensorType: String = "none", debug: Boolean = false)(
tag := tag + 1.U tag := tag + 1.U
} }
when( when(state === sIdle || dataCtrlDone || (set === (tp.tensorLength - 1).U && tag === (tp.numMemBlock - 1).U)) {
state === sIdle || dataCtrlDone || (set === (tp.tensorLength - 1).U && tag === (tp.numMemBlock - 1).U)) {
set := 0.U set := 0.U
}.elsewhen( }.elsewhen((io.vme_rd.data.fire() || isZeroPad) && tag === (tp.numMemBlock - 1).U) {
(io.vme_rd.data.fire() || isZeroPad) && tag === (tp.numMemBlock - 1).U) {
set := set + 1.U set := set + 1.U
} }
...@@ -211,10 +208,12 @@ class TensorLoad(tensorType: String = "none", debug: Boolean = false)( ...@@ -211,10 +208,12 @@ class TensorLoad(tensorType: String = "none", debug: Boolean = false)(
when(state === sIdle) { when(state === sIdle) {
waddr_cur := dec.sram_offset waddr_cur := dec.sram_offset
waddr_nxt := dec.sram_offset waddr_nxt := dec.sram_offset
}.elsewhen((io.vme_rd.data }.elsewhen((io.vme_rd.data.fire() || isZeroPad)
.fire() || isZeroPad) && set === (tp.tensorLength - 1).U && tag === (tp.numMemBlock - 1).U) { && set === (tp.tensorLength - 1).U
&& tag === (tp.numMemBlock - 1).U)
{
waddr_cur := waddr_cur + 1.U waddr_cur := waddr_cur + 1.U
}.elsewhen(dataCtrl.io.stride) { }.elsewhen(dataCtrl.io.stride && io.vme_rd.data.fire()) {
waddr_cur := waddr_nxt + dec.xsize waddr_cur := waddr_nxt + dec.xsize
waddr_nxt := waddr_nxt + dec.xsize waddr_nxt := waddr_nxt + dec.xsize
} }
...@@ -261,8 +260,7 @@ class TensorLoad(tensorType: String = "none", debug: Boolean = false)( ...@@ -261,8 +260,7 @@ class TensorLoad(tensorType: String = "none", debug: Boolean = false)(
} }
// done // done
val done_no_pad = io.vme_rd.data val done_no_pad = io.vme_rd.data.fire() & dataCtrl.io.done & dec.xpad_1 === 0.U & dec.ypad_1 === 0.U
.fire() & dataCtrl.io.done & dec.xpad_1 === 0.U & dec.ypad_1 === 0.U
val done_x_pad = state === sXPad1 & xPadCtrl1.io.done & dataCtrlDone & dec.ypad_1 === 0.U val done_x_pad = state === sXPad1 & xPadCtrl1.io.done & dataCtrlDone & dec.ypad_1 === 0.U
val done_y_pad = state === sYPad1 & dataCtrlDone & yPadCtrl1.io.done val done_y_pad = state === sYPad1 & dataCtrlDone & yPadCtrl1.io.done
io.done := done_no_pad | done_x_pad | done_y_pad io.done := done_no_pad | done_x_pad | done_y_pad
......
...@@ -62,20 +62,38 @@ class TensorStore(tensorType: String = "none", debug: Boolean = false)( ...@@ -62,20 +62,38 @@ class TensorStore(tensorType: String = "none", debug: Boolean = false)(
val tag = Reg(UInt(8.W)) val tag = Reg(UInt(8.W))
val set = Reg(UInt(8.W)) val set = Reg(UInt(8.W))
val xfer_bytes = Reg(chiselTypeOf(io.vme_wr.cmd.bits.addr))
val xstride_bytes = dec.xstride << log2Ceil(tensorLength * tensorWidth)
val maskOffset = VecInit(Seq.fill(M_DRAM_OFFSET_BITS)(true.B)).asUInt
val elemBytes = (p(CoreKey).batch * p(CoreKey).blockOut * p(CoreKey).outBits) / 8
val pulse_bytes_bits = log2Ceil(mp.dataBits >> 3)
val xfer_init_addr = io.baddr | (maskOffset & (dec.dram_offset << log2Ceil(elemBytes)))
val xfer_split_addr = waddr_cur + xfer_bytes
val xfer_stride_addr = waddr_nxt + xstride_bytes
val xfer_init_bytes = xmax_bytes - xfer_init_addr % xmax_bytes
val xfer_init_pulses = xfer_init_bytes >> pulse_bytes_bits
val xfer_split_bytes = xmax_bytes - xfer_split_addr % xmax_bytes
val xfer_split_pulses = xfer_split_bytes >> pulse_bytes_bits
val xfer_stride_bytes = xmax_bytes - xfer_stride_addr % xmax_bytes
val xfer_stride_pulses= xfer_stride_bytes >> pulse_bytes_bits
val sIdle :: sWriteCmd :: sWriteData :: sReadMem :: sWriteAck :: Nil = Enum(5) val sIdle :: sWriteCmd :: sWriteData :: sReadMem :: sWriteAck :: Nil = Enum(5)
val state = RegInit(sIdle) val state = RegInit(sIdle)
// control // control
switch(state) { switch(state) {
is(sIdle) { is(sIdle) {
when(io.start) { xfer_bytes := xfer_init_bytes
when (io.start) {
state := sWriteCmd state := sWriteCmd
when(xsize < xmax) { when (xsize < xfer_init_pulses) {
xlen := xsize xlen := xsize
xrem := 0.U xrem := 0.U
}.otherwise { }.otherwise {
xlen := xmax - 1.U xlen := xfer_init_pulses - 1.U
xrem := xsize - xmax xrem := xsize - xfer_init_pulses
} }
} }
} }
...@@ -101,24 +119,29 @@ class TensorStore(tensorType: String = "none", debug: Boolean = false)( ...@@ -101,24 +119,29 @@ class TensorStore(tensorType: String = "none", debug: Boolean = false)(
when(xrem === 0.U) { when(xrem === 0.U) {
when(ycnt === ysize - 1.U) { when(ycnt === ysize - 1.U) {
state := sIdle state := sIdle
}.otherwise { }.otherwise { // stride
state := sWriteCmd state := sWriteCmd
when(xsize < xmax) { xfer_bytes := xfer_stride_bytes
when(xsize < xfer_stride_pulses) {
xlen := xsize xlen := xsize
xrem := 0.U xrem := 0.U
}.otherwise { }.otherwise {
xlen := xmax - 1.U xlen := xfer_stride_pulses - 1.U
xrem := xsize - xmax xrem := xsize - xfer_stride_pulses
} }
} }
}.elsewhen(xrem < xmax) { } // split
.elsewhen(xrem < xfer_split_pulses) {
state := sWriteCmd state := sWriteCmd
xfer_bytes := xfer_split_bytes
xlen := xrem xlen := xrem
xrem := 0.U xrem := 0.U
}.otherwise { }
.otherwise {
state := sWriteCmd state := sWriteCmd
xlen := xmax - 1.U xfer_bytes := xfer_split_bytes
xrem := xrem - xmax xlen := xfer_split_pulses - 1.U
xrem := xrem - xfer_split_pulses
} }
} }
} }
...@@ -174,8 +197,7 @@ class TensorStore(tensorType: String = "none", debug: Boolean = false)( ...@@ -174,8 +197,7 @@ class TensorStore(tensorType: String = "none", debug: Boolean = false)(
when(state === sIdle) { when(state === sIdle) {
raddr_cur := dec.sram_offset raddr_cur := dec.sram_offset
raddr_nxt := dec.sram_offset raddr_nxt := dec.sram_offset
}.elsewhen(io.vme_wr.data }.elsewhen(io.vme_wr.data.fire() && set === (tensorLength - 1).U && tag === (numMemBlock - 1).U) {
.fire() && set === (tensorLength - 1).U && tag === (numMemBlock - 1).U) {
raddr_cur := raddr_cur + 1.U raddr_cur := raddr_cur + 1.U
}.elsewhen(stride) { }.elsewhen(stride) {
raddr_cur := raddr_nxt + dec.xsize raddr_cur := raddr_nxt + dec.xsize
...@@ -189,18 +211,14 @@ class TensorStore(tensorType: String = "none", debug: Boolean = false)( ...@@ -189,18 +211,14 @@ class TensorStore(tensorType: String = "none", debug: Boolean = false)(
val mdata = MuxLookup(set, 0.U.asTypeOf(chiselTypeOf(wdata_t)), tread) val mdata = MuxLookup(set, 0.U.asTypeOf(chiselTypeOf(wdata_t)), tread)
// write-to-dram // write-to-dram
val maskOffset = VecInit(Seq.fill(M_DRAM_OFFSET_BITS)(true.B)).asUInt
val elemBytes = (p(CoreKey).batch * p(CoreKey).blockOut * p(CoreKey).outBits) / 8
when(state === sIdle) { when(state === sIdle) {
waddr_cur := io.baddr | (maskOffset & (dec.dram_offset << log2Ceil( waddr_cur := xfer_init_addr
elemBytes))) waddr_nxt := xfer_init_addr
waddr_nxt := io.baddr | (maskOffset & (dec.dram_offset << log2Ceil(
elemBytes)))
}.elsewhen(state === sWriteAck && io.vme_wr.ack && xrem =/= 0.U) { }.elsewhen(state === sWriteAck && io.vme_wr.ack && xrem =/= 0.U) {
waddr_cur := waddr_cur + xmax_bytes waddr_cur := xfer_split_addr
}.elsewhen(stride) { }.elsewhen(stride) {
waddr_cur := waddr_nxt + (dec.xstride << log2Ceil(tensorLength * tensorWidth)) waddr_cur := xfer_stride_addr
waddr_nxt := waddr_nxt + (dec.xstride << log2Ceil(tensorLength * tensorWidth)) waddr_nxt := xfer_stride_addr
} }
io.vme_wr.cmd.valid := state === sWriteCmd io.vme_wr.cmd.valid := state === sWriteCmd
......
...@@ -252,8 +252,16 @@ class TensorDataCtrl(tensorType: String = "none", ...@@ -252,8 +252,16 @@ class TensorDataCtrl(tensorType: String = "none",
val caddr = Reg(UInt(mp.addrBits.W)) val caddr = Reg(UInt(mp.addrBits.W))
val baddr = Reg(UInt(mp.addrBits.W)) val baddr = Reg(UInt(mp.addrBits.W))
val len = Reg(UInt(mp.lenBits.W)) val len = Reg(UInt(mp.lenBits.W))
val maskOffset = VecInit(Seq.fill(M_DRAM_OFFSET_BITS)(true.B)).asUInt
val elemBytes =
if (tensorType == "inp") {
(p(CoreKey).batch * p(CoreKey).blockIn * p(CoreKey).inpBits) / 8
} else if (tensorType == "wgt") {
(p(CoreKey).blockOut * p(CoreKey).blockIn * p(CoreKey).wgtBits) / 8
} else {
(p(CoreKey).batch * p(CoreKey).blockOut * p(CoreKey).accBits) / 8
}
val xmax_bytes = ((1 << mp.lenBits) * mp.dataBits / 8).U val xmax_bytes = ((1 << mp.lenBits) * mp.dataBits / 8).U
val xcnt = Reg(UInt(mp.lenBits.W)) val xcnt = Reg(UInt(mp.lenBits.W))
...@@ -262,27 +270,53 @@ class TensorDataCtrl(tensorType: String = "none", ...@@ -262,27 +270,53 @@ class TensorDataCtrl(tensorType: String = "none",
val xmax = (1 << mp.lenBits).U val xmax = (1 << mp.lenBits).U
val ycnt = Reg(chiselTypeOf(dec.ysize)) val ycnt = Reg(chiselTypeOf(dec.ysize))
val xfer_bytes = Reg(UInt(mp.addrBits.W))
val pulse_bytes_bits = log2Ceil(mp.dataBits >> 3)
val xstride_bytes = dec.xstride << log2Ceil(elemBytes)
val xfer_init_addr = io.baddr | (maskOffset & (dec.dram_offset << log2Ceil(elemBytes)))
val xfer_split_addr = caddr + xfer_bytes
val xfer_stride_addr = baddr + xstride_bytes
val xfer_init_bytes = xmax_bytes - xfer_init_addr % xmax_bytes
val xfer_init_pulses = xfer_init_bytes >> pulse_bytes_bits
val xfer_split_bytes = xmax_bytes - xfer_split_addr % xmax_bytes
val xfer_split_pulses = xfer_split_bytes >> pulse_bytes_bits
val xfer_stride_bytes = xmax_bytes - xfer_stride_addr % xmax_bytes
val xfer_stride_pulses= xfer_stride_bytes >> pulse_bytes_bits
val stride = xcnt === len & val stride = xcnt === len &
xrem === 0.U & xrem === 0.U &
ycnt =/= dec.ysize - 1.U ycnt =/= dec.ysize - 1.U
val split = xcnt === len & xrem =/= 0.U val split = xcnt === len & xrem =/= 0.U
when(io.start || (io.xupdate && stride)) { when(io.start) {
when(xsize < xmax) { xfer_bytes := xfer_init_bytes
when(xsize < xfer_init_pulses) {
len := xsize len := xsize
xrem := 0.U xrem := 0.U
}.otherwise { }.otherwise {
len := xmax - 1.U len := xfer_init_pulses - 1.U
xrem := xsize - xmax xrem := xsize - xfer_init_pulses
}
}.elsewhen(io.xupdate && stride) {
xfer_bytes := xfer_stride_bytes
when(xsize < xfer_stride_pulses) {
len := xsize
xrem := 0.U
}.otherwise {
len := xfer_stride_pulses - 1.U
xrem := xsize - xfer_stride_pulses
} }
}.elsewhen(io.xupdate && split) { }.elsewhen(io.xupdate && split) {
when(xrem < xmax) { xfer_bytes := xfer_split_bytes
when(xrem < xfer_split_pulses) {
len := xrem len := xrem
xrem := 0.U xrem := 0.U
}.otherwise { }.otherwise {
len := xmax - 1.U len := xfer_split_pulses - 1.U
xrem := xrem - xmax xrem := xrem - xfer_split_pulses
} }
} }
...@@ -298,25 +332,15 @@ class TensorDataCtrl(tensorType: String = "none", ...@@ -298,25 +332,15 @@ class TensorDataCtrl(tensorType: String = "none",
ycnt := ycnt + 1.U ycnt := ycnt + 1.U
} }
val maskOffset = VecInit(Seq.fill(M_DRAM_OFFSET_BITS)(true.B)).asUInt
val elemBytes =
if (tensorType == "inp") {
(p(CoreKey).batch * p(CoreKey).blockIn * p(CoreKey).inpBits) / 8
} else if (tensorType == "wgt") {
(p(CoreKey).blockOut * p(CoreKey).blockIn * p(CoreKey).wgtBits) / 8
} else {
(p(CoreKey).batch * p(CoreKey).blockOut * p(CoreKey).accBits) / 8
}
when(io.start) { when(io.start) {
caddr := io.baddr | (maskOffset & (dec.dram_offset << log2Ceil(elemBytes))) caddr := xfer_init_addr
baddr := io.baddr | (maskOffset & (dec.dram_offset << log2Ceil(elemBytes))) baddr := xfer_init_addr
}.elsewhen(io.yupdate) { }.elsewhen(io.yupdate) {
when(split) { when(split) {
caddr := caddr + xmax_bytes caddr := xfer_split_addr
}.elsewhen(stride) { }.elsewhen(stride) {
caddr := baddr + (dec.xstride << log2Ceil(elemBytes)) caddr := xfer_stride_addr
baddr := baddr + (dec.xstride << log2Ceil(elemBytes)) baddr := xfer_stride_addr
} }
} }
......
...@@ -34,6 +34,7 @@ case class VCRParams() { ...@@ -34,6 +34,7 @@ case class VCRParams() {
val nECnt = 1 val nECnt = 1
val nVals = 1 val nVals = 1
val nPtrs = 6 val nPtrs = 6
val nUCnt = 1
val regBits = 32 val regBits = 32
} }
...@@ -53,6 +54,7 @@ class VCRMaster(implicit p: Parameters) extends VCRBase { ...@@ -53,6 +54,7 @@ class VCRMaster(implicit p: Parameters) extends VCRBase {
val ecnt = Vec(vp.nECnt, Flipped(ValidIO(UInt(vp.regBits.W)))) val ecnt = Vec(vp.nECnt, Flipped(ValidIO(UInt(vp.regBits.W))))
val vals = Output(Vec(vp.nVals, UInt(vp.regBits.W))) val vals = Output(Vec(vp.nVals, UInt(vp.regBits.W)))
val ptrs = Output(Vec(vp.nPtrs, UInt(mp.addrBits.W))) val ptrs = Output(Vec(vp.nPtrs, UInt(mp.addrBits.W)))
val ucnt = Vec(vp.nUCnt, Flipped(ValidIO(UInt(vp.regBits.W))))
} }
/** VCRClient. /** VCRClient.
...@@ -68,6 +70,7 @@ class VCRClient(implicit p: Parameters) extends VCRBase { ...@@ -68,6 +70,7 @@ class VCRClient(implicit p: Parameters) extends VCRBase {
val ecnt = Vec(vp.nECnt, ValidIO(UInt(vp.regBits.W))) val ecnt = Vec(vp.nECnt, ValidIO(UInt(vp.regBits.W)))
val vals = Input(Vec(vp.nVals, UInt(vp.regBits.W))) val vals = Input(Vec(vp.nVals, UInt(vp.regBits.W)))
val ptrs = Input(Vec(vp.nPtrs, UInt(mp.addrBits.W))) val ptrs = Input(Vec(vp.nPtrs, UInt(mp.addrBits.W)))
val ucnt = Vec(vp.nUCnt, ValidIO(UInt(vp.regBits.W)))
} }
/** VTA Control Registers (VCR). /** VTA Control Registers (VCR).
...@@ -100,7 +103,7 @@ class VCR(implicit p: Parameters) extends Module { ...@@ -100,7 +103,7 @@ class VCR(implicit p: Parameters) extends Module {
// registers // registers
val nPtrs = if (mp.addrBits == 32) vp.nPtrs else 2 * vp.nPtrs val nPtrs = if (mp.addrBits == 32) vp.nPtrs else 2 * vp.nPtrs
val nTotal = vp.nCtrl + vp.nECnt + vp.nVals + nPtrs val nTotal = vp.nCtrl + vp.nECnt + vp.nVals + nPtrs + vp.nUCnt
val reg = Seq.fill(nTotal)(RegInit(0.U(vp.regBits.W))) val reg = Seq.fill(nTotal)(RegInit(0.U(vp.regBits.W)))
val addr = Seq.tabulate(nTotal)(_ * 4) val addr = Seq.tabulate(nTotal)(_ * 4)
...@@ -108,6 +111,7 @@ class VCR(implicit p: Parameters) extends Module { ...@@ -108,6 +111,7 @@ class VCR(implicit p: Parameters) extends Module {
val eo = vp.nCtrl val eo = vp.nCtrl
val vo = eo + vp.nECnt val vo = eo + vp.nECnt
val po = vo + vp.nVals val po = vo + vp.nVals
val uo = po + nPtrs
switch(wstate) { switch(wstate) {
is(sWriteAddress) { is(sWriteAddress) {
...@@ -191,4 +195,12 @@ class VCR(implicit p: Parameters) extends Module { ...@@ -191,4 +195,12 @@ class VCR(implicit p: Parameters) extends Module {
io.vcr.ptrs(i) := Cat(reg(po + 2 * i + 1), reg(po + 2 * i)) io.vcr.ptrs(i) := Cat(reg(po + 2 * i + 1), reg(po + 2 * i))
} }
} }
for (i <- 0 until vp.nUCnt) {
when(io.vcr.ucnt(i).valid) {
reg(uo + i) := io.vcr.ucnt(i).bits
}.elsewhen(io.host.w.fire() && addr(uo + i).U === waddr) {
reg(uo + i) := wdata
}
}
} }
...@@ -22,8 +22,12 @@ ...@@ -22,8 +22,12 @@
#include <vta/dpi/tsim.h> #include <vta/dpi/tsim.h>
#if VM_TRACE #if VM_TRACE
#ifdef VM_TRACE_FST
#include <verilated_fst_c.h>
#else
#include <verilated_vcd_c.h> #include <verilated_vcd_c.h>
#endif #endif
#endif
#if VM_TRACE #if VM_TRACE
#define STRINGIZE(x) #x #define STRINGIZE(x) #x
...@@ -100,7 +104,11 @@ int VTADPISim() { ...@@ -100,7 +104,11 @@ int VTADPISim() {
#if VM_TRACE #if VM_TRACE
Verilated::traceEverOn(true); Verilated::traceEverOn(true);
#ifdef VM_TRACE_FST
VerilatedFstC* tfp = new VerilatedFstC;
#else
VerilatedVcdC* tfp = new VerilatedVcdC; VerilatedVcdC* tfp = new VerilatedVcdC;
#endif // VM_TRACE_FST
top->trace(tfp, 99); top->trace(tfp, 99);
tfp->open(STRINGIZE_VALUE_OF(TSIM_TRACE_FILE)); tfp->open(STRINGIZE_VALUE_OF(TSIM_TRACE_FILE));
#endif #endif
...@@ -142,7 +150,7 @@ int VTADPISim() { ...@@ -142,7 +150,7 @@ int VTADPISim() {
#endif #endif
trace_count++; trace_count++;
if ((trace_count % 1000000) == 1) if ((trace_count % 1000000) == 1)
fprintf(stderr, "[traced %dM cycles]\n", trace_count / 1000000); fprintf(stderr, "[traced %luM cycles]\n", trace_count / 1000000);
while (top->sim_wait) { while (top->sim_wait) {
top->clock = 0; top->clock = 0;
std::this_thread::sleep_for(std::chrono::milliseconds(100)); std::this_thread::sleep_for(std::chrono::milliseconds(100));
......
...@@ -35,6 +35,8 @@ DEVICE = $(shell $(VTA_CONFIG) --get-fpga-dev) ...@@ -35,6 +35,8 @@ DEVICE = $(shell $(VTA_CONFIG) --get-fpga-dev)
DEVICE_FAMILY = $(shell $(VTA_CONFIG) --get-fpga-family) DEVICE_FAMILY = $(shell $(VTA_CONFIG) --get-fpga-family)
# Project name # Project name
PROJECT = de10_nano_top PROJECT = de10_nano_top
# Frequency in MHz
FREQ_MHZ = $(shell $(VTA_CONFIG) --get-fpga-freq)
#--------------------- #---------------------
# Compilation parameters # Compilation parameters
...@@ -55,7 +57,8 @@ endif ...@@ -55,7 +57,8 @@ endif
IP_PATH = $(IP_BUILD_PATH)/VTA.DefaultDe10Config.v IP_PATH = $(IP_BUILD_PATH)/VTA.DefaultDe10Config.v
# Bitstream file path # Bitstream file path
BIT_PATH = $(HW_BUILD_PATH)/export/vta.rbf BIT_PATH = $(HW_BUILD_PATH)/export/vta_$(FREQ_MHZ)MHz.rbf
CPF_OPT := -o bitstream_compression=on
# System design file path # System design file path
QSYS_PATH = $(HW_BUILD_PATH)/soc_system.qsys QSYS_PATH = $(HW_BUILD_PATH)/soc_system.qsys
...@@ -77,13 +80,16 @@ $(QSYS_PATH): $(IP_PATH) ...@@ -77,13 +80,16 @@ $(QSYS_PATH): $(IP_PATH)
cd $(HW_BUILD_PATH) && \ cd $(HW_BUILD_PATH) && \
cp -r $(SCRIPT_DIR)/* $(HW_BUILD_PATH) && \ cp -r $(SCRIPT_DIR)/* $(HW_BUILD_PATH) && \
python3 $(SCRIPT_DIR)/set_attrs.py -i $(IP_PATH) -o $(HW_BUILD_PATH)/ip/vta/VTAShell.v $(DSP_FLAG) && \ python3 $(SCRIPT_DIR)/set_attrs.py -i $(IP_PATH) -o $(HW_BUILD_PATH)/ip/vta/VTAShell.v $(DSP_FLAG) && \
qsys-script --script=soc_system.tcl $(DEVICE) $(DEVICE_FAMILY) qsys-script --script=soc_system.tcl $(DEVICE) $(DEVICE_FAMILY) $(FREQ_MHZ)
$(BIT_PATH): $(QSYS_PATH) $(BIT_PATH): $(QSYS_PATH)
cd $(HW_BUILD_PATH) && \ cd $(HW_BUILD_PATH) && \
quartus_sh -t $(SCRIPT_DIR)/compile_design.tcl $(DEVICE) $(PROJECT) && \ quartus_sh -t $(SCRIPT_DIR)/compile_design.tcl $(DEVICE) $(PROJECT) && \
mkdir -p $(shell dirname $(BIT_PATH)) && \ mkdir -p $(shell dirname $(BIT_PATH)) && \
quartus_cpf -c $(HW_BUILD_PATH)/$(PROJECT).sof $(BIT_PATH) quartus_cpf $(CPF_OPT) -c $(HW_BUILD_PATH)/$(PROJECT).sof $(BIT_PATH)
clean: clean:
rm -rf $(BUILD_DIR) rm -rf $(BUILD_DIR)
clean-qsys:
rm -rf $(QSYS_PATH)
...@@ -31,6 +31,9 @@ set_input_delay -clock altera_reserved_tck -clock_fall 3 [get_ports altera_reser ...@@ -31,6 +31,9 @@ set_input_delay -clock altera_reserved_tck -clock_fall 3 [get_ports altera_reser
set_input_delay -clock altera_reserved_tck -clock_fall 3 [get_ports altera_reserved_tms] set_input_delay -clock altera_reserved_tck -clock_fall 3 [get_ports altera_reserved_tms]
set_output_delay -clock altera_reserved_tck 3 [get_ports altera_reserved_tdo] set_output_delay -clock altera_reserved_tck 3 [get_ports altera_reserved_tdo]
# Turn off warning on unconstrained LED port.
set_false_path -to [get_ports {LED[0]}]
# Create Generated Clock # Create Generated Clock
derive_pll_clocks derive_pll_clocks
......
...@@ -67,11 +67,15 @@ def server_start(): ...@@ -67,11 +67,15 @@ def server_start():
@tvm.register_func("tvm.contrib.vta.init", override=True) @tvm.register_func("tvm.contrib.vta.init", override=True)
def program_fpga(file_name): def program_fpga(file_name):
# pylint: disable=import-outside-toplevel # pylint: disable=import-outside-toplevel
env = get_env()
if env.TARGET == "pynq":
from pynq import xlnk from pynq import xlnk
# Reset xilinx driver # Reset xilinx driver
xlnk.Xlnk().xlnk_reset() xlnk.Xlnk().xlnk_reset()
elif env.TARGET == "de10nano":
# Load the de10nano program function.
load_vta_dll()
path = tvm.get_global_func("tvm.rpc.server.workpath")(file_name) path = tvm.get_global_func("tvm.rpc.server.workpath")(file_name)
env = get_env()
program_bitstream.bitstream_program(env.TARGET, path) program_bitstream.bitstream_program(env.TARGET, path)
logging.info("Program FPGA with %s ", file_name) logging.info("Program FPGA with %s ", file_name)
...@@ -90,9 +94,11 @@ def server_start(): ...@@ -90,9 +94,11 @@ def server_start():
cfg_json : str cfg_json : str
JSON string used for configurations. JSON string used for configurations.
""" """
env = get_env()
if runtime_dll: if runtime_dll:
if env.TARGET == "de10nano":
print("Please reconfigure the runtime AFTER programming a bitstream.")
raise RuntimeError("Can only reconfig in the beginning of session...") raise RuntimeError("Can only reconfig in the beginning of session...")
env = get_env()
cfg = json.loads(cfg_json) cfg = json.loads(cfg_json)
cfg["TARGET"] = env.TARGET cfg["TARGET"] = env.TARGET
pkg = PkgConfig(cfg, proj_root) pkg = PkgConfig(cfg, proj_root)
......
...@@ -77,6 +77,12 @@ class PkgConfig(object): ...@@ -77,6 +77,12 @@ class PkgConfig(object):
if self.TARGET in ["pynq", "ultra96"]: if self.TARGET in ["pynq", "ultra96"]:
# add pynq drivers for any board that uses pynq driver stack (see pynq.io) # add pynq drivers for any board that uses pynq driver stack (see pynq.io)
self.lib_source += glob.glob("%s/vta/src/pynq/*.cc" % (proj_root)) self.lib_source += glob.glob("%s/vta/src/pynq/*.cc" % (proj_root))
elif self.TARGET in ["de10nano"]:
self.lib_source += glob.glob("%s/vta/src/de10nano/*.cc" % (proj_root))
self.include_path += [
"-I%s/vta/src/de10nano" % proj_root,
"-I%s/3rdparty" % proj_root
]
# Linker flags # Linker flags
if self.TARGET in ["pynq", "ultra96"]: if self.TARGET in ["pynq", "ultra96"]:
......
...@@ -19,7 +19,7 @@ import os ...@@ -19,7 +19,7 @@ import os
import argparse import argparse
def main(): def main():
"""Main funciton""" """Main function"""
parser = argparse.ArgumentParser() parser = argparse.ArgumentParser()
parser.add_argument("target", type=str, default="", parser.add_argument("target", type=str, default="",
help="target") help="target")
...@@ -27,7 +27,7 @@ def main(): ...@@ -27,7 +27,7 @@ def main():
help="bitstream path") help="bitstream path")
args = parser.parse_args() args = parser.parse_args()
if (args.target != 'pynq' and args.target != 'sim'): if args.target not in ('pynq', 'ultra96', 'de10nano', 'sim', 'tsim'):
raise RuntimeError("Unknown target {}".format(args.target)) raise RuntimeError("Unknown target {}".format(args.target))
curr_path = os.path.dirname( curr_path = os.path.dirname(
...@@ -48,9 +48,17 @@ def pynq_bitstream_program(bitstream_path): ...@@ -48,9 +48,17 @@ def pynq_bitstream_program(bitstream_path):
bitstream = Bitstream(bitstream_path) bitstream = Bitstream(bitstream_path)
bitstream.download() bitstream.download()
def de10nano_bitstream_program(bitstream_path):
# pylint: disable=import-outside-toplevel
from tvm import get_global_func
program = get_global_func("vta.de10nano.program")
program(bitstream_path)
def bitstream_program(target, bitstream): def bitstream_program(target, bitstream):
if target in ['pynq', 'ultra96']: if target in ['pynq', 'ultra96']:
pynq_bitstream_program(bitstream) pynq_bitstream_program(bitstream)
elif target in ['de10nano']:
de10nano_bitstream_program(bitstream)
elif target in ['sim', 'tsim']: elif target in ['sim', 'tsim']:
# In simulation, bit stream programming is a no-op # In simulation, bit stream programming is a no-op
return return
......
...@@ -49,6 +49,9 @@ def program_fpga(remote, bitstream=None): ...@@ -49,6 +49,9 @@ def program_fpga(remote, bitstream=None):
else: else:
bitstream = get_bitstream_path() bitstream = get_bitstream_path()
if not os.path.isfile(bitstream): if not os.path.isfile(bitstream):
env = get_env()
if env.TARGET == 'de10nano':
return
download_bitstream() download_bitstream()
fprogram = remote.get_function("tvm.contrib.vta.init") fprogram = remote.get_function("tvm.contrib.vta.init")
......
...@@ -59,8 +59,8 @@ def run(run_func): ...@@ -59,8 +59,8 @@ def run(run_func):
tracker_port = os.environ.get("TVM_TRACKER_PORT", None) tracker_port = os.environ.get("TVM_TRACKER_PORT", None)
# Otherwise, we can set the variables below to directly # Otherwise, we can set the variables below to directly
# obtain a remote from a test device # obtain a remote from a test device
pynq_host = os.environ.get("VTA_PYNQ_RPC_HOST", None) pynq_host = os.environ.get("VTA_RPC_HOST", None)
pynq_port = os.environ.get("VTA_PYNQ_RPC_PORT", None) pynq_port = os.environ.get("VTA_RPC_PORT", None)
# Run device from fleet node if env variables are defined # Run device from fleet node if env variables are defined
if tracker_host and tracker_port: if tracker_host and tracker_port:
remote = autotvm.measure.request_remote(env.TARGET, remote = autotvm.measure.request_remote(env.TARGET,
...@@ -75,7 +75,7 @@ def run(run_func): ...@@ -75,7 +75,7 @@ def run(run_func):
run_func(env, remote) run_func(env, remote)
else: else:
raise RuntimeError( raise RuntimeError(
"Please set the VTA_PYNQ_RPC_HOST and VTA_PYNQ_RPC_PORT environment variables") "Please set the VTA_RPC_HOST and VTA_RPC_PORT environment variables")
else: else:
raise RuntimeError("Unknown target %s" % env.TARGET) raise RuntimeError("Unknown target %s" % env.TARGET)
...@@ -27,6 +27,8 @@ ...@@ -27,6 +27,8 @@
extern "C" { extern "C" {
#endif #endif
#include <stddef.h>
/** /**
* \brief Initialize CMA api (basically perform open() syscall). * \brief Initialize CMA api (basically perform open() syscall).
* *
......
...@@ -21,11 +21,14 @@ ...@@ -21,11 +21,14 @@
*/ */
#include "de10nano_driver.h" #include "de10nano_driver.h"
#include "de10nano_mgr.h"
#include <string.h> #include <string.h>
#include <vta/driver.h> #include <vta/driver.h>
#include <tvm/runtime/registry.h>
#include <dmlc/logging.h> #include <dmlc/logging.h>
#include <thread> #include <thread>
#include <string>
#include "cma_api.h" #include "cma_api.h"
void* VTAMemAlloc(size_t size, int cached) { void* VTAMemAlloc(size_t size, int cached) {
...@@ -72,12 +75,16 @@ void *VTAMapRegister(uint32_t addr) { ...@@ -72,12 +75,16 @@ void *VTAMapRegister(uint32_t addr) {
uint32_t virt_offset = addr - virt_base; uint32_t virt_offset = addr - virt_base;
// Open file and mmap // Open file and mmap
uint32_t mmap_file = open("/dev/mem", O_RDWR|O_SYNC); uint32_t mmap_file = open("/dev/mem", O_RDWR|O_SYNC);
return mmap(NULL, // Note that if virt_offset != 0, i.e. addr is not page aligned
// munmap will not be unmapping all memory.
void *vmem = mmap(NULL,
(VTA_IP_REG_MAP_RANGE + virt_offset), (VTA_IP_REG_MAP_RANGE + virt_offset),
PROT_READ|PROT_WRITE, PROT_READ|PROT_WRITE,
MAP_SHARED, MAP_SHARED,
mmap_file, mmap_file,
virt_base); virt_base);
close(mmap_file);
return vmem;
} }
void VTAUnmapRegister(void *vta) { void VTAUnmapRegister(void *vta) {
...@@ -149,6 +156,24 @@ int VTADeviceRun(VTADeviceHandle handle, ...@@ -149,6 +156,24 @@ int VTADeviceRun(VTADeviceHandle handle,
insn_phy_addr, insn_count, wait_cycles); insn_phy_addr, insn_count, wait_cycles);
} }
void VTAProgram(const char* bitstream) { void VTAProgram(const char *rbf) {
CHECK(false) << "VTAProgram not implemented for de10nano"; De10NanoMgr mgr;
CHECK(mgr.mapped()) << "de10nano: mapping of /dev/mem failed";
CHECK(mgr.program_rbf(rbf)) << "Programming of the de10nano failed.\n"
"This is usually due to the use of an RBF file that is incompatible "
"with the MSEL switches on the DE10-Nano board. The recommended RBF "
"format is FastPassiveParallel32 with compression enabled, "
"corresponding to MSEL 01010. An RBF file in FPP32 mode can be "
"generated in a Quartus session with the command "
"'quartus_cpf -o bitstream_compression=on -c <file>.sof <file>.rbf'.";
} }
using tvm::runtime::TVMRetValue;
using tvm::runtime::TVMArgs;
TVM_REGISTER_GLOBAL("vta.de10nano.program")
.set_body([](TVMArgs args, TVMRetValue* rv) {
std::string bitstream = args[0];
VTAProgram(bitstream.c_str());
});
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
import sys, os
import tvm
from tvm import rpc
from vta import get_bitstream_path, download_bitstream, program_fpga, reconfig_runtime
host = os.environ.get("VTA_RPC_HOST", "de10nano")
port = int(os.environ.get("VTA_RPC_PORT", "9091"))
def program_rpc_bitstream(path=None):
"""Program the FPGA on the RPC server
Parameters
----------
path : path to bitstream (optional)
"""
assert tvm.runtime.enabled("rpc")
remote = rpc.connect(host, port)
program_fpga(remote, path)
def reconfig_rpc_runtime():
"""Reconfig the RPC server runtime
"""
assert tvm.runtime.enabled("rpc")
remote = rpc.connect(host, port)
reconfig_runtime(remote)
bitstream = sys.argv[1] if len(sys.argv) == 2 else None
program_rpc_bitstream(bitstream)
reconfig_rpc_runtime()
...@@ -20,8 +20,8 @@ from tvm import te ...@@ -20,8 +20,8 @@ from tvm import te
from tvm import rpc from tvm import rpc
from vta import get_bitstream_path, download_bitstream, program_fpga, reconfig_runtime from vta import get_bitstream_path, download_bitstream, program_fpga, reconfig_runtime
host = os.environ.get("VTA_PYNQ_RPC_HOST", "pynq") host = os.environ.get("VTA_RPC_HOST", "pynq")
port = int(os.environ.get("VTA_PYNQ_RPC_PORT", "9091")) port = int(os.environ.get("VTA_RPC_PORT", "9091"))
def program_rpc_bitstream(path=None): def program_rpc_bitstream(path=None):
"""Program the FPGA on the RPC server """Program the FPGA on the RPC server
......
...@@ -109,8 +109,8 @@ if env.TARGET not in ["sim", "tsim"]: ...@@ -109,8 +109,8 @@ if env.TARGET not in ["sim", "tsim"]:
# Otherwise if you have a device you want to program directly from # Otherwise if you have a device you want to program directly from
# the host, make sure you've set the variables below to the IP of # the host, make sure you've set the variables below to the IP of
# your board. # your board.
device_host = os.environ.get("VTA_PYNQ_RPC_HOST", "192.168.2.99") device_host = os.environ.get("VTA_RPC_HOST", "192.168.2.99")
device_port = os.environ.get("VTA_PYNQ_RPC_PORT", "9091") device_port = os.environ.get("VTA_RPC_PORT", "9091")
if not tracker_host or not tracker_port: if not tracker_host or not tracker_port:
remote = rpc.connect(device_host, int(device_port)) remote = rpc.connect(device_host, int(device_port))
else: else:
......
...@@ -149,8 +149,8 @@ if env.TARGET not in ["sim", "tsim"]: ...@@ -149,8 +149,8 @@ if env.TARGET not in ["sim", "tsim"]:
# Otherwise if you have a device you want to program directly from # Otherwise if you have a device you want to program directly from
# the host, make sure you've set the variables below to the IP of # the host, make sure you've set the variables below to the IP of
# your board. # your board.
device_host = os.environ.get("VTA_PYNQ_RPC_HOST", "192.168.2.99") device_host = os.environ.get("VTA_RPC_HOST", "192.168.2.99")
device_port = os.environ.get("VTA_PYNQ_RPC_PORT", "9091") device_port = os.environ.get("VTA_RPC_PORT", "9091")
if not tracker_host or not tracker_port: if not tracker_host or not tracker_port:
remote = rpc.connect(device_host, int(device_port)) remote = rpc.connect(device_host, int(device_port))
else: else:
......
...@@ -47,12 +47,12 @@ from vta.testing import simulator ...@@ -47,12 +47,12 @@ from vta.testing import simulator
env = vta.get_env() env = vta.get_env()
# We read the Pynq RPC host IP address and port number from the OS environment # We read the Pynq RPC host IP address and port number from the OS environment
host = os.environ.get("VTA_PYNQ_RPC_HOST", "192.168.2.99") host = os.environ.get("VTA_RPC_HOST", "192.168.2.99")
port = int(os.environ.get("VTA_PYNQ_RPC_PORT", "9091")) port = int(os.environ.get("VTA_RPC_PORT", "9091"))
# We configure both the bitstream and the runtime system on the Pynq # We configure both the bitstream and the runtime system on the Pynq
# to match the VTA configuration specified by the vta_config.json file. # to match the VTA configuration specified by the vta_config.json file.
if env.TARGET == "pynq": if env.TARGET == "pynq" or env.TARGET == "de10nano":
# Make sure that TVM was compiled with RPC=1 # Make sure that TVM was compiled with RPC=1
assert tvm.runtime.enabled("rpc") assert tvm.runtime.enabled("rpc")
......
...@@ -51,8 +51,8 @@ from vta.testing import simulator ...@@ -51,8 +51,8 @@ from vta.testing import simulator
env = vta.get_env() env = vta.get_env()
# We read the Pynq RPC host IP address and port number from the OS environment # We read the Pynq RPC host IP address and port number from the OS environment
host = os.environ.get("VTA_PYNQ_RPC_HOST", "192.168.2.99") host = os.environ.get("VTA_RPC_HOST", "192.168.2.99")
port = int(os.environ.get("VTA_PYNQ_RPC_PORT", "9091")) port = int(os.environ.get("VTA_RPC_PORT", "9091"))
# We configure both the bitstream and the runtime system on the Pynq # We configure both the bitstream and the runtime system on the Pynq
# to match the VTA configuration specified by the vta_config.json file. # to match the VTA configuration specified by the vta_config.json file.
......
...@@ -50,8 +50,8 @@ from vta.testing import simulator ...@@ -50,8 +50,8 @@ from vta.testing import simulator
env = vta.get_env() env = vta.get_env()
# We read the Pynq RPC host IP address and port number from the OS environment # We read the Pynq RPC host IP address and port number from the OS environment
host = os.environ.get("VTA_PYNQ_RPC_HOST", "192.168.2.99") host = os.environ.get("VTA_RPC_HOST", "192.168.2.99")
port = int(os.environ.get("VTA_PYNQ_RPC_PORT", "9091")) port = int(os.environ.get("VTA_RPC_PORT", "9091"))
# We configure both the bitstream and the runtime system on the Pynq # We configure both the bitstream and the runtime system on the Pynq
# to match the VTA configuration specified by the vta_config.json file. # to match the VTA configuration specified by the vta_config.json file.
......
...@@ -71,12 +71,12 @@ from tvm.contrib import util ...@@ -71,12 +71,12 @@ from tvm.contrib import util
from vta.testing import simulator from vta.testing import simulator
# We read the Pynq RPC host IP address and port number from the OS environment # We read the Pynq RPC host IP address and port number from the OS environment
host = os.environ.get("VTA_PYNQ_RPC_HOST", "192.168.2.99") host = os.environ.get("VTA_RPC_HOST", "192.168.2.99")
port = int(os.environ.get("VTA_PYNQ_RPC_PORT", "9091")) port = int(os.environ.get("VTA_RPC_PORT", "9091"))
# We configure both the bitstream and the runtime system on the Pynq # We configure both the bitstream and the runtime system on the Pynq
# to match the VTA configuration specified by the vta_config.json file. # to match the VTA configuration specified by the vta_config.json file.
if env.TARGET == "pynq": if env.TARGET == "pynq" or env.TARGET == "de10nano":
# Make sure that TVM was compiled with RPC=1 # Make sure that TVM was compiled with RPC=1
assert tvm.runtime.enabled("rpc") assert tvm.runtime.enabled("rpc")
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment