Abstract Handling imbalanced datasets in binary classification, especially in employment big data, is challenging. Traditional methods like oversampling and undersampling have limitations. This paper integrates TabNet and Generative Adversarial Networks (GANs) to address class imbalance. The generator creates synthetic samples for the minority class, and the discriminator, using TabNet, ensures authenticity. Evaluations on benchmark datasets show significant improvements in accuracy, precision, recall, and F1-score for the minority class, outperforming traditional methods. This integration offers a robust solution for imbalanced datasets in employment big data, leading to fairer and more effective predictive models.
In order to predict the process window of laser powder bed fusion (LPBF) for printing metallic components, the calculation of volumetric energy density (VED) has been widely calculated for controlling process parameters. However, because it is assumed that the process parameters contribute equally to heat input, the VED still has limitation for predicting the process window of LPBF-processed materials. In this study, an explainable machine learning (xML) approach was adopted to predict and understand the contribution of each process parameter to defect evolution in Ti alloys in the LPBF process. Various ML models were trained, and the Shapley additive explanation method was adopted to quantify the importance of each process parameter. This study can offer effective guidelines for fine-tuning process parameters to fabricate high-quality products using LPBF.
Truck no-show behavior has posed significant disruptions to the planning and execution of port operations. By delving into the key factors that contribute to truck appointment no-shows and proactively predicting such behavior, it becomes possible to make preemptive adjustments to port operation plans, thereby enhancing overall operational efficiency. Considering the data imbalance and the impact of accuracy for each decision tree on the performance of the random forest model, a model based on the Borderline Synthetic Minority Over-Sampling Technique and Weighted Random Forest (BSMOTE-WRF) is proposed to predict truck appointment no-shows and explore the relationship between truck appointment no-shows and factors such as weather conditions, appointment time slot, the number of truck appointments, and traffic conditions. In order to illustrate the effectiveness of the proposed model, the experiments were conducted with the available dataset from the Tianjin Port Second Container Terminal. It is demonstrated that the prediction accuracy of BSMOTE-WRF model is improved by 4%-5% compared with logistic regression, random forest, and support vector machines. Importance ranking of factors affecting truck no-show indicate that (1) The number of truck appointments during specific time slots have the highest impact on truck no-show behavior, and the congestion coefficient has the secondhighest impact on truck no-show behavior and its influence is also significant; (2) Compared to the number of truck appointments and congestion coefficient, the impact of severe weather on truck no-show behavior is relatively low, but it still has some influence; (3) Although the impact of appointment time slots is lower than other influencing factors, the influence of specific time slots on truck no-show behavior should not be overlooked. The BSMOTE-WRF model effectively analyzes the influencing factors and predicts truck no-show behavior in appointment-based systems.
The importance of Structural Health Monitoring (SHM) in the industry is increasing due to various loads, such as earthquakes and wind, having a significant impact on the performance of structures and equipment. Estimating responses is crucial for the effective health management of these assets. However, using numerous sensors in facilities and equipment for response estimation causes economic challenges. Additionally, it could require a response from locations where sensors cannot be attached. Digital twin technology has garnered significant attention in the industry to address these challenges. This paper constructs a digital twin system utilizing the Long Short-Term Memory (LSTM) model to estimate responses in a pipe system under simultaneous seismic load and arbitrary loads. The performance of the data-driven digital twin system was verified through a comparative analysis of experimental data, demonstrating that the constructed digital twin system successfully estimated the responses.
The construction industry stands out for its higher incidence of accidents in comparison to other sectors. A causal analysis of the accidents is necessary for effective prevention. In this study, we propose a data-driven causal analysis to find significant factors of fatal construction accidents. We collected 14,318 cases of structured and text data of construction accidents from the Construction Safety Management Integrated Information (CSI). For the variables in the collected dataset, we first analyze their patterns and correlations with fatal construction accidents by statistical analysis. In addition, machine learning algorithms are employed to develop a classification model for fatal accidents. The integration of SHAP (SHapley Additive exPlanations) allows for the identification of root causes driving fatal incidents. As a result, the outcome reveals the significant factors and keywords wielding notable influence over fatal accidents within construction contexts.
최근 수십 년 동안, 데이터는 기업 조직 경영의 핵심 요소로 부상하였다. 많은 조직들이 데이터를 활용 하여 전략적인 의사결정을 내리고 시장 변화에 적극적으로 대응하고 있다. 이러한 상황에서 본 연구는 데이터 기반 의사결정 조직 운영과 그에 영향을 미치는 요인을 살펴보고자 한다. 상시적 디지털 전환이 일어나고 있는 현대에 데이터 중심 의사결정은 조직의 성과 향상에 매우 중요한 역할을 한다. 그러나 데 이터 기반 의사결정 조직에 영향을 미치는 선행 요인과 실제로 기업 내에서 데이터 기반 의사결정이 어떻게 일어나는지에 대한 연구는 아직 많이 부족한 실정이다. 본 연구는 기업의 밸류체인 디지털화 정도가 데이터 기반 의사결정 조직 구축에 중요한 영향을 미칠 것임을 가설로 설정하고, 이를 국내 기업 임직원 1,059명을 대상으로 한 설문응답 데이터로 검증하였다. 또한, 본 연구는 데이터 분석 능력을 포함한 디지 털 역량을 갖춘 인재가 데이터 중심 의사결정 조직에 중요한 환경적 요건으로 작용할 수 있음을 고려하 여, 기업의 밸류체인 디지털화와 데이터 중심 의사결정 조직 구축 간의 관계에 디지털 인재 준비도가 미 치는 조절효과를 가설로 설정하고 통계적으로 검증하였다. 본 연구의 결과는 데이터 중심 의사결정 조직 형성과 운영에 대한 이해를 넓히고 기업 조직이 데이터를 효과적으로 활용하여 의사결정을 내리는 과정 에 대한 유용한 시사점을 제공할 수 있다. 실무적 측면에서는 기업들이 자신의 데이터 전략을 개발하고 구현하는 데 중요한 시사점을 제공할 수 있을 것으로 기대한다.
PURPOSES : The objective of this study is to develop the data driven pavement condition index by considering the traffic and climatic characteristics in Incheon city. METHODS : The Incheon pavement condition index (IPCI) was proposed using the weighted sum concept with standardization and coefficient of variation for measured pavement performance data, such as crack rate, rut depth, and International Roughness Index (IRI). A correlation study between the National Highway Pavement Condition Index (NHPCI) and Seoul Pavement Condition Index (SPI) was conducted to validate the accuracy of the IPCI. RESULTS : The equation for determining the IPCI was developed using standardization and the coefficient of variation for the crack rate, rut depth, and IRI collected in the field. It was found from the statistical analysis that the weight factors of the IPCI for the crack rate were twice as high as those for the rut depth and IRI. It was also observed that IPCI had a close correlation with the NHPCI and SPI, albeit with some degree of scattering. This correlation study between the NHPCI and SPI indicates that the existing pavement condition index does not consider the asymmetry of the original measured data. CONCLUSIONS : The proposed pavement condition provides an index value that considers the characteristics of the original raw data measured in the field. The developed pavement condition index is extensively used to determine the timing and method of pavement repair, and to establish pavement maintenance and rehabilitation strategies in Incheon.
Evaluating the quantitative damage to rocks through acoustic emission (AE) has become a research focus. Most studies mainly used one or two AE parameters to evaluate the degree of damage, but several AE parameters have been rarely used. In this study, several data-driven models were employed to reflect the combined features of AE parameters. Through uniaxial compression tests, we obtained mechanical and AE-signal data for five granite specimens. The maximum amplitude, hits, counts, rise time, absolute energy, and initiation frequency expressed as the cumulative value were selected as input parameters. The result showed that gradient boosting (GB) was the best model among the support vector regression methods. When GB was applied to the testing data, the root-mean-square error and R between the predicted and actual values were 0.96 and 0.077, respectively. A parameter analysis was performed to capture the parameter significance. The result showed that cumulative absolute energy was the main parameter for damage prediction. Thus, AE has practical applicability in predicting rock damage without conducting mechanical tests. Based on the results, this study will be useful for monitoring the near-field rock mass of nuclear waste repository.
The purpose of this study was to analyze six English as a Foreign Language (EFL) learners’ trajectories of discriminating near-synonyms in a data-driven learning task. Since the learners find it considerably difficult to learn subtle meaning differences of near-synonyms, corpuscorpuscorpuscorpuscorpuscorpus-based data-driven learning may provide an opportunity for them to tackle their difficulties. The study materials guided the learners to identify the differences between the four pairs of near-synonyms, categorize the concordance lines based on their findings, and generalize the findings. The six participants had notably different trajectories of discriminating near-synonyms. The qualitative analysis of the trajectories showed a tendency that the intermediate learners focused on the meanings and found the correct answer without knowing the core meaning, and the advanced learners moved further to attend to structural differences and sometimes tested their previous knowledge on the concordance data. This study implies the need for careful guidance, collaborative group works, and strategy teaching in data-driven learning tasks.
This paper proposed data driven techniques to forecast the time point of water management of the water reservoir without measuring manganese concentration with the empirical data as Juam Dam of years of 2015 and 2016. When the manganese concentration near the surface of water goes over the criteria of 0.3mg/l, the water management should be taken. But, it is economically inefficient to measure manganese concentration frequently and regularly. The water turnover by the difference of water temperature make manganese on the floor of water reservoir rise up to surface and increase the manganese concentration near the surface. Manganese concentration and water temperature from the surface to depth of 20m by 5m have been time plotted and exploratory analyzed to show that the water turnover could be used instead of measuring manganese concentration to know the time point of water management. Two models for forecasting the time point of water turnover were proposed and compared as follow: The regression model of CR20, the consistency ratio of water temperature, between the surface and the depth of 20m on the lagged variables of CR20 and the first lag variable of max temperature. And, the Box-Jenkins model of CR20 as ARIMA (2, 1, 2).
Recent headlines predict that artificial intelligence, machine learning, predictive analytics and other aspects of cognitive computing will be the next fundamental drivers of economic growth (Brynjolfsson & McAfee, 2017). We have evidenced several success stories in the recent years, such as those of Google and Facebook, wherein novel business opportunities have evolved based on data-driven business innovations. Our directional poll among companies, however, reveals that at present, only few companies have the keys to successfully harness these possibilities. Even fever companies seem to be successful in running profitable business based on data-driven business innovations. Company’s capability to create data-driven business relates to company’s overall capability to innovate. Therefore, this research builds a conceptual model of barriers to data-driven business innovations and proposes that a deeper understanding of innovation barriers can assist companies in becoming closer to the possibilities that data-driven business innovations can enable. As Hadjimanolis (2003) suggests, the first step in overcoming innovation barriers is to understand such barriers. Consequently, we identify technology-related, organizational, environmental and people-related i.e. attitudinal barriers and examine how these relate to company’s capability to create data-driven business innovations. Specifically, technology-related barriers may originate from the company’s existing practices and predominant technological standards. Organizational barriers reflect the company’s inability to integrate new patterns of behavior into the established routines and practices (Sheth & Ram, 1987). Environmental barriers refer to various types of hampering factors that are external to a company. Environmental barriers are caused by the company’s external environment and thus company has relatively limited possibilities to influence and overcome such factors. Attitudinal barriers are people-related perceptual barriers that can be studied at the individual level, and if necessary, separately for managers and employees (Hadjimanolis, 2003). Future research will pursue to build an empirical model to examine how these different barriers are related to company’s capability to create business based on data-driven innovations.
Due to advances in machine intelligence and increased demands for autonomous machines, the complexity of the underlying software platform is increasing at a rapid pace, overwhelming the developers with implementation details. We attempt to ease the burden that falls onto the developers by creating a graphical programming framework we named Splash. Splash is designed to provide an effective programming abstraction for autonomous machines that require stream processing. It also enables programmers to specify genuine, end-to-end timing constraints, which the Splash framework automatically monitors for violation. By utilizing the timing constraints, Splash provides three key language semantics: timing semantics, in-order delivery semantics, and rate-controlled data-driven stream processing semantics. These three semantics together collectively serve as a conceptual tool that can hide low-level details from programmers, allowing developers to focus on the main logic of their applications. In this paper, we introduce the three-language semantics in detail and explain their function in association with Splash’s language constructs. Furthermore, we present the internal workings of the Splash programming framework and validate its effectiveness via a lane keeping assist system.
상수도관의 파열은 과도한 압력, 노후화, 온도변화나 지진 등에 의한 지반이동에 의해 발생한다. 상수도관 파열이 대규모 단수, 싱크홀 등과 같은 더 심각한 피해 이어지지 않도록 신속하게 탐지 및 대응하는 것이 중요하다. 본 연구에서는 상수도관 파열 탐지를 위해 개선 Western Electric Company (WECO) 방법을 개발하였다. 개선 WECO 방법은 통계적공정관리기법 중 하나인 기존 WECO 방법에 임계치 조정자(w)를 추가하여 대상 네트워크에 적합한 이상탐지 의사결정을 할 수 있도록 했다. 개발된 개선 WECO 방법을 미국 텍사스 오스틴 관망에 적용 및 검증하였다. 상수도관 파열 발생 시 측정한 비정상데이터와 수요량 변동만 고려한 정상데이터를 이용하여 기존 및 개선 WECO 방법을 비교하였다. 최적 임계치 조정자 w값을 결정하기 위해 민감도 분석을 수행하였으며, 다양한 계측시간 간격 데이터(dt = 5, 10, 15분 등)의 영향도 분석하였다. 각 경우 별 탐지 성능은 탐지확률, 오경보확률, 평균탐지시간을 계산하여 비교하였다. 본 연구에서는 도출된 결과를 바탕으로 WECO 방법을 실제 상수도관 파열 탐지에 적용하기 위한 가이드라인을 제공한다.