세계 해양산업은 자율운항선박 기술의 등장으로 급속도로 발전하고 있으며, 해양 데이터에서 파생된 인공지능 활용에 관한 관 심이 높아지고 있다. 다양한 기술 발전 중에서 선박 항로 군집화는 자율운항선박 상용화를 위한 중요한 기술로 부각되고 있다. 항로 군집 화를 통해 해상에서 선박 항로 패턴을 추출하여 가장 빠르고 안전한 항로를 최적화하고 충돌 방지 시스템의 개발에 기반이 된다. 항로 군 집화 알고리즘의 정확성과 효율성을 보장하기 위해 고품질의 잘 처리된 데이터가 필수적이다. 본 연구에서는 다양한 항로 군집화 방법 중 항로의 실제 형태와 특성을 정확히 반영할 수 있는 선박 항로 유사도 기반 군집화 방식에 주목하였다. 이러한 방식의 효율을 극대화하 기 위해 최적의 데이터 전처리 기술 조합을 구성하고자 한다. 구체적으로, 4가지의 선박 항로 간 유사도 측정법과 3가지의 차원 축소 방 법을 조합하여 연구를 진행하였다. 각 조합에 대해 k-means 군집 분석을 수행하고, 그 결과를 Silhouette Index를 통해 정량적으로 평가하여 최고 성능을 보이는 전처리 기법 조합을 도출하였다. 본 연구는 단순히 최적의 전처리 기법을 찾는 것에 그치지 않고, 광범위한 해양 데 이터에서 의미 있는 정보를 추출하는 과정의 중요성을 강조한다. 이는 4차 산업혁명 시대의 해양 및 해운 산업이 직면한 디지털 전환에 효과적으로 대응하기 위한 기초 연구로서 의의를 갖는다.
PURPOSES : In this study, a preliminary study on the optimal clustering techniques for the preprocessing of pavement management system (PMS) data was conducted using K-means and mean-shift techniques to improve the correlation between the dependent and independent variables of the pavement performance model. METHODS : The PMS data of Jeju Island was preprocessed using the K-means and mean-shift algorithms. In the case of the K-means method, the elbow method and silhouette score were used to determine the optimal number of clusters (K). Moreover, in the case of the mean-shift method, Scott’s rule of thumb and Silverman’s rule of thumb were used to determine the optimal cluster bandwidth. RESULTS : The optimal cluster sets were selected for the rut depth (RD), annual average daily traffic (AADT), and annual maximum temperature (AMT) for each clustering technique, and their similarities with the original data were investigated. Additionally, the correlation improvement between the dependent and independent variables were investigated by calculating the clustering score (CS). Consequently, the K-means method was selected as the optimal clustering technique for the preprocessing of PMS data. The K-means method improved the correlations of more variables with the dependent variable compared to the mean-shift method. The correlations of the variables related to high temperature—such as the annual temperature change, summer days, and heat wave days—were improved in the case wherein the AMT, a climate factor, was used as an independent variable in the K-means clustering method. CONCLUSIONS : The applicability of the clustering methods to preprocessing of PMS data was identified in this study. Improvements in the pavement performance prediction model developed using traditional statistical methods may be identified by developing a model using clustering techniques in a future study.
PURPOSES : Local governments in Korea, including Incheon city, have introduced the pavement management system (PMS). However, the verification of the repair time and repair section of roads remains difficult owing to the non-existence of a systematic data acquisition system. Therefore, data refinement is performed using various techniques when analyzing statistical data in diverse fields. In this study, clustering is used to analyze PMS data, and correlation analysis is conducted between pavement performance and influencing factors.
METHODS : First, the clustering type was selected. The representative clustering types include K-means, mean shift, and density-based spatial clustering of applications with noise (DBSCAN). In this study, data purification was performed using DBSCAN for clustering. Because of the difficulty in determining a threshold for high-dimensional data, multiple clustering, which is a type of DBSCAN, was applied, and the number of clustering was set up to two. Clustering for the surface distress (SD), rut depth (RD), and international roughness index (IRI) was performed twice using the number of frost days, the highest temperature, and the average temperature, respectively.
RESULTS : The clustering result shows that the correlation between the SD and number of frost days improved significantly. The correlation between the maximum temperature factor and precipitation factor, which does not indicate multicollinearity, improved. Meanwhile, the correlation between the RD and highest temperature improved significantly. The correlation between the minimum temperature factor and precipitation factor, which does not exhibit multicollinearity, improved considerably. The correlation between the IRI and average temperature improved as well. The correlation between the low- and high-temperature precipitation factors, which does not indicate multicollinearity, improved.
CONCLUSIONS : The result confirms the possibility of applying clustering to refine PMS data and that the correlation among the pavement performance factors improved. However, when applying clustering to PMS data refinement, the limitations must be identified and addressed. Furthermore, clustering may be applicable to the purification of PMS data using AI.
수역 내 충돌 위험 식별은 항해의 안전을 위해 중요하다. 본 연구에서는 거리 요인을 기반으로 한 군집화 방법인 계층 클러스 터링을 포함하는 새로운 충돌 위험 평가 방법을 도입했으며, 주변의 선박이 많은 경우 실시간 데이터, 그룹 방법론 및 예비 평가를 사용하여 선박을 분류하고 충돌위험평가를 기반으로 평가하였다(HCAAP 처리라 부른다). 조우하는 선박들의 군집은 계층 프로그램에 의해 모아지고, 예비 평가와 결합되어 상대적으로 안전한 선박을 걸러내었다. 그런 다음, 각 군집 내에서 조우하는 선박 사이의 최근접점(DCPA) 및 최근접점까지의 도착시간(TCPA)까지의 시간을 계산하여 충돌위험지수(CRI)와의 관계를 비교하였다. 조우하는 선박들간의 군집에서 CRI와 DCPA 및 TCPA 수학적 관계는 음의 지수 함수로 구성되었다. 이러한 CRI로부터 운영자는 명시된 해역에서 항해하는 모든 선박의 안전성을 보다 쉽게 평가할 수 있으며, 프레임워크는 해상운송의 안전과 보안을 개선하고 인명 및 재산 손실을 줄일 수 있다. 본 연구에 서 제안된 프레임워크의 효과를 설명하기 위해 국내의 목포 연안 해역에서 실험 사례 연구를 수행하였다. 그 결과, 본 연구의 프레임워크가 각 군집 내에서 조우 선박 간의 충돌 위험 지수를 탐지하고 순위를 매기는 데 효과적이고 효율적이라는 것을 보여 주었으며, 추가연구를 위한 자동 위험 우선순위를 지정할 수 있게 해주었다.
K-means algorithm is one of the most popular and widely used clustering method because it is easy to implement and very efficient. However, this method has the limitation to be used with fixed number of clusters because of only considering the intra-cluster distance to evaluate the data clustering solutions. Silhouette is useful and stable valid index to decide the data clustering solution with number of clusters to consider the intra and inter cluster distance for unsupervised data. However, this valid index has high computational burden because of considering quality measure for each data object. The objective of this paper is to propose the fast and simple speed-up method to overcome this limitation to use silhouette for the effective large-scale data clustering. In the first step, the proposed method calculates and saves the distance for each data once. In the second step, this distance matrix is used to calculate the relative distance rate (Vj) of each data j and this rate is used to choose the suitable number of clusters without much computation time. In the third step, the proposed efficient heuristic algorithm (Group search optimization, GSO, in this paper) can search the global optimum with saving computational capacity with good initial solutions using Vj probabilistically for the data clustering. The performance of our proposed method is validated to save significantly computation time against the original silhouette only using Ruspini, Iris, Wine and Breast cancer in UCI machine learning repository datasets by experiment and analysis. Especially, the performance of our proposed method is much better than previous method for the larger size of data.
Data clustering is one of the most difficult and challenging problems and can be formally considered as a particular kind of NP-hard grouping problems. The K-means algorithm is one of the most popular and widely used clustering method because it is easy to implement and very efficient. However, it has high possibility to trap in local optimum and high variation of solutions with different initials for the large data set. Therefore, we need study efficient computational intelligence method to find the global optimal solution in data clustering problem within limited computational time. The objective of this paper is to propose a combined artificial bee colony (CABC) with K-means for initialization and finalization to find optimal solution that is effective on data clustering optimization problem. The artificial bee colony (ABC) is an algorithm motivated by the intelligent behavior exhibited by honeybees when searching for food. The performance of ABC is better than or similar to other population-based algorithms with the added advantage of employing fewer control parameters. Our proposed CABC method is able to provide near optimal solution within reasonable time to balance the converged and diversified searches. In this paper, the experiment and analysis of clustering problems demonstrate that CABC is a competitive approach comparing to previous partitioning approaches in satisfactory results with respect to solution quality. We validate the performance of CABC using Iris, Wine, Glass, Vowel, and Cloud UCI machine learning repository datasets comparing to previous studies by experiment and analysis. Our proposed KABCK (K-means+ABC+K-means) is better than ABCK (ABC+K-means), KABC (K-means+ABC), ABC, and K-means in our simulations.
Data clustering determines a group of patterns using similarity measure in a dataset and is one of the most important and difficult technique in data mining. Clustering can be formally considered as a particular kind of NP-hard grouping problem. K-means algorithm which is popular and efficient, is sensitive for initialization and has the possibility to be stuck in local optimum because of hill climbing clustering method. This method is also not computationally feasible in practice, especially for large datasets and large number of clusters. Therefore, we need a robust and efficient clustering algorithm to find the global optimum (not local optimum) especially when much data is collected from many IoT (Internet of Things) devices in these days. The objective of this paper is to propose new Hybrid Simulated Annealing (HSA) which is combined simulated annealing with K-means for non-hierarchical clustering of big data. Simulated annealing (SA) is useful for diversified search in large search space and K-means is useful for converged search in predetermined search space. Our proposed method can balance the intensification and diversification to find the global optimal solution in big data clustering. The performance of HSA is validated using Iris, Wine, Glass, and Vowel UCI machine learning repository datasets comparing to previous studies by experiment and analysis. Our proposed KSAK (K-means+SA+K-means) and SAK (SA+K-means) are better than KSA(K-means+SA), SA, and K-means in our simulations. Our method has significantly improved accuracy and efficiency to find the global optimal data clustering solution for complex, real time, and costly data mining process.
As Internet has been wildly spreaded and it's technique is advanced, the use of computers has been routinized and almost data are stored in computers. Accordingly, many companies and researchers have tried to find the relations in these tremendous data and the one way is to use clustering algorithm which is used to find out similar data set in the entire data set and to discover the common properties. In early period, clustering algorithm was performed based on a main memory of a computer and PAM(Partitioning Around Medoids) was representative, which can be complemented k-means algorithm defeat. PAM performs clustering by using the medoid of data instead of means. PAM works well in small data set but it is difficult to apply it to large data set. Therefore, CLARA(Clutering LARge Application) shows up to be used in large data set. This algorithm samples data from large data set and applies PAM to the sample data. CLARA has limits caused by the fixed samples in each clustering stage and has a problem that if the good mediod is not sampled then the result of the clustering becomes not good. CLARANS(Clustering Large Application based upon Randomized Search) overcomes these problems by drawing a sample with some randomness. This algorithm executes clustering using k mediod set extracted in the processing of clustering in each stage. The main objective is to compare and analyze the algorithms which are popularly used for the clustering of big data.
A new algorithm has been propose to detect the reflected light region as disturbances in a real-time vision system. There have been several attempts to detect existing reflected light region. The conventional mathematical approach requires a lot of complex processes so that it is not suitable for a real-time vision system. On the other hand, when a simple detection process has been applied, the reflected light region can not be detected accurately. Therefore, in order to detect reflected light region for a real-time vision system, the detection process requires a new algorithm that is as simple and accurate as possible. In order to extract the reflected light, the proposed algorithm has been adopted several filter equations and clustering processes in the HSI (Hue Saturation Intensity) color space. Also the proposed algorithm used the pre-defined reflected light data generated through the clustering processes to make the algorithm simple. To demonstrate the effectiveness of the proposed algorithm, several images with the reflected region have been used and the reflected regions are detected successfully.
본 연구의 목적은 데이터 클러스터링을 활용해 기존의 플레이어 유형 이론을 비교하고 검증 하는 것이다. 연구 진행을 위해 A 대학교 2016년 2학기에 진행된 초대형 강의 수강생의 결과 데이터 235개를 활용했다. 본 연구에서는 K-평균(Means)과 적절한 클러스터 수를 결정하기 위 해 실루엣(Silhouette) 평가기법을 적용했다. 적용한 플레이어 유형은 바틀의 2차원, 3차원 플레 이어 유형, Ferro의 5 가지 유형, 브레인헥스이다. 연구결과에 따르면, 바틀의 2차원 플레이어 유형이 데이터 클러스터링 관점에서 가장 적합한 것으로 나타났다. 각 플레이어 유형 별 특성 분포도 해석했다. 본 연구결과는 게이미피케이션을 적용하거나 개발 프로세스를 연구할 때 사 용되는 플레이어 분석 부분에 영향을 미칠 것으로 예상된다.
노후수도관 개량사업은 예산상, 시공상 등의 여러 제약조건에 의해서 장기적인 계획 하에 시행되게 된다. 본 연구에서는 연구대상지역에서 1992년부터 1997년 사이에 기록된 누수 위치좌표 약 8,000개를 이용하여 누수 위치들 간의 공간적 상관관계에 대한 계층적 군집분석을 수행한다. 계층적 군집분석방법 중 최단 연결법, 최장 연결법 및 평균 연결법을 적용하여 연구대상지역을 누수위치의 공간적 상관관계에 따라 분할하였으며, 각 군집 방법 별로 분할된 구역들을