A Framework for Evaluating Prompt Engineering Techniques for PHP Code Generation Using Open-Source Large Language Models
DOI:
https://doi.org/10.63561/jca.v3i1.1204Keywords:
Prompt Engineering, Code Generation, Large Language Models, Open-Source LLMs, Software EngineeringAbstract
As generative Artificial Intelligence (AI) systems become increasingly integrated into software development workflows in software engineering, there is a need for rigorous and reproducible evaluation of how prompt engineering techniques influence the quality, reliability, and efficiency of code generated by open-source large language models (LLMs). This paper presents a structured empirical evaluation framework to assess the effects of prompt techniques (zero-shot, few-shot, and chain-of-thought) on PHP code generation using three commonly adopted open-source models ( CodeLLaMA, Mistral, and StarCoder2). A reproducible experimental pipeline was developed to execute controlled prompt–model templates across multiple coding tasks (easy, medium and difficult) in software engineering. The framework automatically measures multiple performance metrics, including functional accuracy (pass@1 and pass@10), execution time, memory usage, lines of code, cyclomatic complexity, and coding standard violations. The results indicated that the most effective prompt strategy was the use of few-shot prompting on all models and then zero-shot that demonstrated improvement with repeated sampling (pass@10). Tasks with reasoning requirements benefited from chain-of-thought prompting in some aspects, which made the code more complex and lengthier. In the analysis of the effect of model variability for the consideration of this study, two-way Analysis of Variance (ANOVA) indicated that the underlying model architecture is still a significant factor in determining the performance differences. The execution time and memory usage were similar for all the strategies in prompting, while the code quality measures indicated structural differences in the code based on the prompt design. These results indicate that prompt engineering influences observable performance trends while adhering to the model’s architecture and task-level limitations, hence the need to consider the combination of prompt techniques, model capacity, task types, and evaluation methodology. THe assessment framework contributes to the development of standardized benchmarking tools in the PHP ecosystem, aside the contribution to the body of empirical research. This framework is a foundation for potential future extension to multiple metrics, reproducible assessment of open-source LLMs across programming languages, and task domains.
References
Ashraf, H., Danish, S. M., Rahman, S., & Sattar, Z. (2025). Toward Green Code: Prompting small language models for Energy-Efficient Code generation. https://arxiv.org/abs/2509.09947
Almeida, J. (2025). Prompt Engineering: A comparative study of prompting techniques in AI language models. Prompt Engineering: A Comparative Study of Prompting Techniques in AI Language Models, 1–4. https://doi.org/10.1109/isec64801.2025.11147384
Chen, M., Tworek, J., Jun, H., Yuan, Q., De Oliveira Pinto, H. P., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., . . . Zaremba, W. (2021, July 7). Evaluating large language models trained on code. https://arxiv.org/abs/2107.03374
Ge, Y., Mei, L., Duan, Z., Li, T., Zheng, Y., Wang, Y., Wang, L., Yao, J., Liu, T., Cai, Y., Bi, B., Guo, F., Guo, J., Liu, S., & Cheng, X. (2025). A Survey of Vibe Coding with Large Language Models. https://arxiv.org/abs/2510.12399
Gao, S., Wen, X., Gao, C., Wang, W., Zhang, H., & Lyu, M. R. (2023). What Makes Good In-Context Demonstrations for Code Intelligence Tasks with LLMs? ASE ’23: Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering, 761–773. https://doi.org/10.1109/ase56229.2023.00109
Haque, M. A. (2025). LLMs: A game-changer for software engineers? BenchCouncil Transactions on Benchmarks Standards and Evaluations, 5(1), 100204. https://doi.org/10.1016/j.tbench.2025.10020
Li, Y., Shi, J., & Zhang, Z. (2024). An approach for rapid source code development based on ChatGPT and prompt engineering. Purple Mountain Laboratories; State Key Laboratory of Mathematical Engineering and Advanced Computing.
Khojah, R., De Oliveira Neto, F. G., Mohamad, M., & Leitner, P. (2024, December 29). The impact of prompt programming on Function-Level code Generation. arXiv.org. https://arxiv.org/abs/2412.20545
Montgomery, D. C. (2017). Design and analysis of experiments (9th ed.). Wiley.
Mayer, L., Heumann, C., & Aßenmacher, M. (2024, September 6). Can OpenSource beat ChatGPT? -- A Comparative Study of Large Language Models for Text-to-Code Generation. arXiv.org. https://arxiv.org/abs/2409.04164
Shin, J., Tang, C., Mohati, T., Nayebi, M., Wang, S., & Hemmati, H. (2023, October 11). Prompt Engineering or Fine-Tuning: An Empirical Assessment of LLMs for code. arXiv.org. https://arxiv.org/abs/2310.10508
Shin, J., Wei, M., Wang, J., Shi, L., & Wang, S. (2023). The good, the bad, and the missing: Neural code generation for machine learning tasks. ACM Transactions on Software Engineering and Methodology, 33(2), 1–24. https://doi.org/10.1145/3630009
Tony, C., Ferreyra, N. E. D., Mutas, M., Dhif, S., & Scandariato, R. (2025). Prompting Techniques for secure code Generation: A Systematic investigation. ACM Transactions on Software Engineering and Methodology, 34(8), 1–53. https://doi.org/10.1145/3722108
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). Chain-of-Thought prompting elicits reasoning in large language models. https://arxiv.org/abs/2201.11903


