Advancements in VLSI has made it attractive to package multiple processors into a single multichip or a board module. There is an increasing trend towards using such processor-clusters in large multiprocessor design. Past research on designing processor-cluster based systems has focused mainly in studying the packaging technologies affecting the inter-cluster network. To make processor-cluster based multiprocessor design more attractive, there is a strong need to understand the details about the topology inside the cluster, its memory organization, and the impact of this organization on system performance. In this paper we focus on such aspects of processor-cluster design with an overall objective to support a logically shared address programming model. We analyze the communication costs for accesing inter-cluster and intra-cluster memories under different cluster organizations. The merits of these organizations are evaluated based on the performance of collective communication algorithms, which occur frequently in appplications. In this paper we focus on implementing the broadcast collective communication algorithm, Umesh, on clustered systems. Our results indicate that cluster organizations like bus and crossbar which allow memory inside a cluster to be accessed without messaging overheads, outperform other organizations because of faster intra-cluster access. We also demonstrate that such faster access can be exploited to design better algorithms on clustered systems. We propose a new algorithm - clus mesh for broadcasting on clustered meshes. For reasonably faster communication within clusters, this algorithm can outperform the existing umesh algorithm by upto 20%.