RDD虽然是spark抽象出来的概念,但不得不理解.
什么是RDD
RDD(resilient distributed datasets),弹性分布式数据集.
spark的核心概念,简单说就是分布式的元素集合.
RDD操作
- 转化操作:并不进行实际运算
1
2
3
4
5scala> val lines = sc.textFile("README.md")
lines: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[1] at textFile at <console>:2
scala> val pythonLines = lines.filter(line=>line.contains("Python"))
res3: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at filter at <console>:27 - 行动操作(Action):做实际运算
1
2
3
4
5
6
7scala> pythonLines.count()
res5: Long = 3
scala> pythonLines.take(10).foreach(line=>println(line))
high-level APIs in Scala, Java, Python, and R, and an optimized engine that
## Interactive Python Shell
Alternatively, if you prefer Python, you can use the Python shell: - 惰性求值:在调用行动操作前不会开始计算.
常见操作
- 常见转化操作
1
2
3
4filter
map
flatMap
distinct - 伪集合操作
1
2
3
4union
intersection
substract
cartesian - 行动操作
1
2
3
4
5
6
7
8collect
count
take
top
reduce
fold
aggregate
foreach持久化
由于RDD是惰性求值的,所以我们多次使用行动操作时每次都会重算,所以缓存起来比较好.持久化级别有几种:1
2
3
4
5
6import org.apache.spark.storage.StorageLevel
val result = input.map(x=>x*x)
result.persist(StorageLevel.DISK_ONLY)
println(result.count())
println(result.collect().mkString(","))内存写不下会放到磁盘,spark使用的LRU算法.1
2
3MEMORY_ONLY
MEMORY_AND_DISK
DISK_ONLY