Large Language Models for Data Augmentation in Concrete Engineering: A Machine Learning Study on Biochar Cement Replacement
The application of machine learning in concrete technology has expanded rapidly, yet its reliability is often constrained by limited experimental data, heterogeneous testing conditions, and inconsistencies across published studies. This study investigates the integration of machine learning and synthetic data augmentation to predict the compressive strength of concrete incorporating biochar as a partial replacement for cement. An experimental dataset was compiled from peer-reviewed journal articles indexed in Web of Science, focusing on biochar-modified concrete mixtures. Input variables included cement content, fine and coarse aggregates, biochar dosage, water to binder ratio, superplasticizer content, and curing age, with compressive strength as the target variable. Extreme Gradient Boosting was adopted due to its strong performance on nonlinear tabular data. Model performance was evaluated using the mean absolute error (MAE), mean squared error (MSE), and coefficient of determination (R²), alongside five-fold cross-validation. Hyperparameter optimization was performed using Optuna. To address data scarcity, a synthetic dataset of 1000 samples was generated using ChatGPT. the large language model approach relied solely on natural language prompts. Only feature definitions and the target variable were provided, without exposing the original data or implementing data generation algorithms. Three modeling strategies were examined. First, model trained and tested solely on experimental data achieved a testing R² of approximately 0.91. Second, model trained on synthetic data and evaluated exclusively on experimental data showed reduced generalization, achieving a testing R² of about 0.42, indicating pronounced domain shift effects. Third, synthetic and experimental data were combined through data augmentation and jointly modeled, a testing R² of 0.93 was achieved. The result showed that the use of LLMs for augmentation improved the performance of the model.