![[mllib] Pyspark Kmeans 알고리즘 사용법](https://img1.daumcdn.net/thumb/R750x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdna%2FtzEI7%2FbtryEyqO1h2%2FAAAAAAAAAAAAAAAAAAAAADYI0zs0_jH3hR8Xb6OShwWJawf6BBVkTRGNPBTUTpcC%2Fimg.png%3Fcredential%3DyqXZFxpELC7KVnFOS48ylbz2pIh7yKj8%26expires%3D1751295599%26allow_ip%3D%26allow_referer%3D%26signature%3Dl8rDBOV%252Bdg5FjqCK%252BChHDwHTQT4%253D)
Pyspark Mllib에서 Kmeans 알고리즘 사용법을 다루는 글이다. (데이터브릭스 이용) 1. 라이브러리 세팅 import pandas as pd # 스파크 sql 내장함수 from pyspark.sql import * # 스파크 sql 내 자료구조 타입 from pyspark.sql.types import * # 스파크 sql 내 여러 함수들 from pyspark.sql.functions import * # 스케일러, VectorAssembler -> 여러 자료값들을 하나의 vector로 모아줌 from pyspark.ml.feature import MinMaxScaler, VectorAssembler # mllib에서 연산을 위해서는 vector형 자료 구조로 변환 from pyspark.m..
![[Pyspark] from pyspark.sql import * VS from pyspark.sql.functions import *](https://img1.daumcdn.net/thumb/R750x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdna%2Fbu0YXE%2FbtryE2ZuPLs%2FAAAAAAAAAAAAAAAAAAAAABvkIvNkB56cWpKlDRlvVkuXoWvbuSXOUPAWcROIvLUl%2Fimg.png%3Fcredential%3DyqXZFxpELC7KVnFOS48ylbz2pIh7yKj8%26expires%3D1751295599%26allow_ip%3D%26allow_referer%3D%26signature%3Dicp%252F3k%252B99tgLEHGwBYNAI6MJf1I%253D)
오늘은 pyspark.sql 내 메서드들에 대해 알아보려 한다. 0. import pyspark.sql https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#module-pyspark.sql.functions pyspark.sql module — PySpark 2.4.0 documentation Parameters: path – string, or list of strings, for input path(s), or RDD of Strings storing CSV rows. schema – an optional pyspark.sql.types.StructType for the input schema or a DDL-formatted string..