Joint channel estimation and data detection (JED) enables near-optimal error-rate performance in realistic wireless communication systems that suffer from channel estimation errors. In this paper, we propose a new JED algorithm and a corresponding FPGA design for large single-input multiple-output (SIMO) wireless systems that use constant-modulus constellations. Our algorithm, referred to as PrOX (short for PRojection Onto conveX hull), relies on biconvex relaxation (BCR) in order to efficiently compute an approximate solution of the maximum-likelihood JED problem that exhibits prohibitive complexity. PrOX is a simple and hardware-friendly algorithm that achieves near-optimal error-rate performance for a wide-range of system configurations. To demonstrate the efficacy of PrOX, we develop a scalable VLSI architecture and present reference implementation results on a Xilinx Virtex-7 FPGA. Compared to a recently-reported reference JED design, PrOX achieves 3x higher throughput, 20x better hardware-efficiency (in terms of throughput per look-up tables), and 8x improved energy-efficiency.