Wordcount程序是Hadoop上的经典“HelloWorld”程序。CDH系统自带了wordcount程序来检测部署的成功与否。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# 解压提前准备好的莎士比亚全集
[sujx@elephant ~]$ gzip -d shakespeare.txt.gz

# 上传至hadoop文件系统
[sujx@elephant ~]$ hdfs dfs -mkdir /user/sujx/input
[sujx@elephant ~]$ hdfs dfs -put shakespeare.txt /user/sujx/input

# 查看有哪些测试程序可用
[sujx@elephant ~]$ hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar
An example program must be given as the first argument.
Valid program names are:
aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi. dbcount: An example job that count the pageview counts from a database.
distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi. grep: A map/reduce program that counts the matches of a regex in the input.
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems. pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
randomtextwriter: A map/reduce program that writes 10GB of random textual data per node. randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sort: A map/reduce program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts the words in the input files.
wordmean: A map/reduce program that counts the average length of the words in the input files.
wordmedian: A map/reduce program that counts the median length of the words in the input files.
wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.

# 执行mapreduce运算,output文件夹会自动建立
[sujx@elephant ~]$ hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount /user/sujx/input/shakespeare.txt /user/sujx/output/

# 查看输出结果
[sujx@elephant ~]$ hdfs dfs -ls /user/sujx/output
Found 4 items
-rw-r--r-- 3 sujx supergroup 0 2020-03-09 02:47 /user/sujx/output/_SUCCESS -rw-r--r-- 3 sujx supergroup 238211 2020-03-09 02:47 /user/sujx/output/part-r-00000 -rw-r--r-- 3 sujx supergroup 236617 2020-03-09 02:47 /user/sujx/output/part-r-00001 -rw-r--r-- 3 sujx supergroup 238668 2020-03-09 02:47 /user/sujx/output/part-r-00002

# 查看输出内容
[sujx@elephant ~]$ hdfs dfs -tail /user/sujx/output/part-r-00000
. 3
writhled 1
writing, 4
writings. 1
writs 1
written, 3
wrong 112
wrong'd- 1
wrong-should 1
wrong. 39
wrong: 1
wronged 11
wronged. 3
wronger, 1
wronger; 1
wrongfully? 1
wrongs 40
wrongs, 9
wrongs; 9
wrote? 1
wrought, 4
…………