Description
Implement the
Huffman Tree Data Structure to support the
Huffman Encoding Algorithm. A discussion of the main ideas is given in Section 5.2 of
Algorithms by Dasgupta et al. This data structure and algorithm will be used to encode (compress) large text files. The
Huffman Tree is a
binary tree where each node stores a subset of characters and their cumulative frequency.
Preliminaries
- Create project HuffmanZip
- All classes for this assignment should be put in separate files
- See Section JUnit Tests below for details on how test
- reading: Ch. 5, pp. 152-155 | example: PDF
HNode
HNode
is similar to
HNode
in a
binary search tree and has the following data members:
- pointers to the left and right children of this node
- (single data member) the symbols that are stored in the leaves of the subtree represented by this node
- the (cumulative) frequency of the symbols stored in this node
- [ this will not be a generic class, since the type of data items is fixed ]
Create class
HNode
with the following methods:
HNode(char c, int f)
Creates a leaf node representing the given character and its frequency.
|
HNode(HNode left, HNode right)
Creates a node with the given left and right children.
|
boolean isLeaf()
Returns true if the node is a leaf.
|
boolean contains(char ch)
Returns true if the node contains the given character (no loops; use the relevant String method(s)).
|
char getSymbol()
Returns the symbol stored in the node. If the node is not a leaf, returns the null character '\0' .
|
String toString()
Returns a string representation of the node in the format symbols:frequency . For example a:20 or cdah:90 .
|
HNodeComparator
Create class
HNodeComparator
that compares two
HNode
objects based on their
frequencies
. When the frequencies are the same, compare the
symbol sets
lexicographically (i.e. dictionary order; use method
compareTo
of
String
class).
This comparator is used for constructing a
Priority Queue
as part
of the algorithm for building a
HuffmanTree
. This is similar to
the
Binary Search Tree
which also needed a
comparator in
the constructor.
HuffmanTree
Create a class
HuffmanTree
with the following methods:
data members
The Huffman Tree has only one data member, which is the root of the tree.
|
HuffmanTree(TreeMap<Character, Integer> frequencies)
Builds a Huffman Tree from the given characters and their corresponding frequencies. Look for a relevant method of the map that lets you get an iterable collection of Entries .
We are using TreeMap here, which is a hash map that offers a consistent (in fact, sorted) traversal of its keys/entries, which in turn ensures that we always get the same Huffman Tree.
Building the tree works as follows: Create HNode
foreach Entry and store it in a Priority
Queue . Repeatedly pop two HNode s, merge them into a new HNode and put the new node in the queue. Stop when the queue has only one item -- that item is the root of the tree.
|
String encodeLoop(char symbol)
Returns the binary encoding of the given symbol as a string of '0' and '1' characters (it is assumed that the symbol is in the tree).
|
String encode(char symbol)
Returns the binary encoding of the given symbol as a string of '0' and '1' characters (it is assumed that the symbol is in the tree)
See method encode(char,HNode) .
|
String encode(char symbol, HNode curr)
(recursive) Returns the binary encoding of the given symbol as a string of '0' and '1' characters starting at the given node.
It is assumed that the symbol is in the tree. For (sub)trees with a single node, the code of the symbol is the empty string "" .
|
char decode(String code)
Returns the symbol that corresponds to the given code (or the null character '\0' if this is not a valid code).
|
void writeCode(char symbol, BitOutputStream stream)
Writes the individual bits of the binary encoding of the given symbol to the given bit stream (it is assumed that the symbol is in the tree).
This is similar to the method encodeLoop(...) but here the bits/values 1 and 0 are written to the given Bit Stream, instead of being appended to a String .
|
char readCode(BitInputStream stream)
Reads from the given stream the individual bits of the binary encoding of the next symbol and returns the corresponding character; (or the null character '\0' if the bits in the stream did not lead to a symbol).
This is similar to the method decode(String) but the 1s and 0s come from the given stream , not from a String .
|
JUnit Tests
Create class
HuffmanTreeTest
that shows evidence of thorough testing with the following methods:
- the Tester will not have any data members and will have only two
@Test
methods
- make sure to put
@Test
in front each method of the Tester
- create a method
test_HuffmanTree()
:
- create a
TreeMap
and fill it with some test data of characters and corresponding frequencies (similar to the class example or the book, but create your own tree with your own frequencies)
- create the tree
- test only methods
encode(char)
, encodeLoop(char)
, decode(String)
for each character in the tree, including test cases that test invalid input where relevant
- create method
test_HNode()
and test all HNode
methods:
- create a couple of nodes, check if they are leaves and check if they contain their characters; merge the nodes into a new parent node and check the same thing for the parent
Do not test class
HuffmanZip
with JUnit. This will be done in the terminal by actually running the application to compress a large file (see Section "Running from Command Line").
HuffmanZip
Create class
HuffmanZip
that allows the user to encode and decode a text file using the
Huffman Encoding Algorithm. The data structures to consider in your implementation are:
- TreeMap: a hash map variant for counting the character frequencies
- PriorityQueue: for building the Huffman Tree; you could try to make the code work with your own data structure,
Binary Search Tree
, used as a priority queue with the method removeMin()
Class
HuffmanZip
must have only static members. Below are the required methods for this class, but consider adding additional (private) methods - the guiding principle is
one loop per method:
void encode(String filename)
Encodes the text file with the given name using the Huffman Encoding Algorithm.
Put the .hz extension to the name of the encoded/compressed file. For example:
war-and-peace.txt becomes war-and-peace.txt.hz
Given the name of a text file the method produces as output a binary file as follows:
- read the given text file one character at time to build a map of character frequencies
- build the Huffman Tree
- save the map of frequencies to the binary file
- again read the given text file one character at time and use the Huffman Tree to write the binary code of each character to the binary file
- (see below for reading/writing regular and binary files)
For example:
wap.txt: The Project Gutenberg EBook of War and Peace... [the text input file]
wap.txt.hz: *********01010010010101010101010100010001010101111101... [the binary output file]
|the map||the binary codes of T,h,e, ,P,r,o,j,e,c,t, ...
|
void decode(String filename)
Decodes the text file with the given name using the Huffman Encoding Algorithm.
Put the .huz extension to the name of the decoded/text file. For example:
war-and-peace.txt.hz becomes war-and-peace.txt.huz
Given the name of a binary file the method produces as output the original text file as follows:
- read the map from the binary file and build the Huffman Tree
- use the Huffman Tree to extract each character from the binary file and immediately write the character to the text file
- (see below for reading/writing regular and binary files)
For example:
|the map||the binary codes of T,h,e, ,P,r,o,j,e,c,t, ...
wap.txt.hz: *********01010010010101010101010100010001010101111101... [the binary input file]
wap.txt.huz: The Project Gutenberg EBook of War and Peace... [the text output file]
|
the standard main method
This is the standard main method. See section Test Files for the files to download, the download location, and how to check the file sizes.
Initially, inside main simply run the relevant method you want to test/execute with a fixed file name. For example:
encode("tlc-logic.txt"); // encode/compress it
decoded("tlc-logic.txt.hz"); // decode/decompress it
Make sure to run HuffmanZip at last once to ensure that it works with the hardcoded values.
Then change the main to use its command-line parameters (the bolded words below) which are stored in the String[] parameter args of the main method:
- the first cell of
args will contain either the string "-encode" or the string "-decode" (use .equals() )
- the second cell of
args will contain the name of the file
Eventually, it should be possible to run your program from the command line as shown below in Section Executable JAR.
|
Reading/Writing Text Files
Read and write the regular/uncompressed text files (
.txt,
.huz)
one byte/character at a time. There are a number of way to accomplish this, but for this assignment use the following Java classes:
- FileInputStream: use this class for reading character by character the text file to compress; see method
int read()
which reads a single byte/symbol from the stream; will need to typecast to char
- FileOutputStream: use this class for writing character by character the decoded text file; see method
void write(int b)
(writes a single byte/symbol to the stream; no need to typecast to int
Don't forget to close the stream.
Reading/Writing Compressed Files
Read and write the compressed text files (
.hz)
one bit at a time. Download in your project the following files:
BitOutputStream.java ,
BitInputStream.java
Here is the API:
- BitOutputStream: use this class to write the frequency table and the individual bits to the encoded/compressed file; see methods
writeBit(int)
, writeObject(Object)
- BitInputStream: use this class to read the frequency table and the individual bits from the encoded/compressed file; see methods
readBit()
, readObject()
These classes allow you to read from (write to) the stream a data structure as one whole object using methods:
readObject()
, writeObject(Object)
Use these methods to read/write the TreeMap from/to the binary input/output files.
Test Files
To test your code download the following files in the
HuffmaZip/
project folder (not in
src/
, not in
bin/
).
In the left panel in Eclipse click on the project name (
HuffmanZip
) and hit F5, i.e. refresh the project - the
.txt
files should show up as part of the project.
To check your work, open a terminal and go to the main project folder (
HuffmanZip
) and list the folder contents - the 5-th column shows the file size in bytes:
cd Desktop/cs216/HuffmanZip (make sure you are in project folder)
ls -l (MacOS)
dir (WinOS)
---------- x xxxxxxx xxxx 140 xxx xx xx:xx tlc-logic.txt
---------- x xxxxxxx xxxx 668 xxx xx xx:xx tlc-logic.txt.hz
---------- x xxxxxxx xxxx 3288707 xxx xx xx:xx war-and-peace.txt
---------- x xxxxxxx xxxx 1881432 xxx xx xx:xx war-and-peace.txt.hz
^
|
size/bytes
Executable JAR
Create an executable JAR file that can be run as a standalone program. Follow steps 1-4 described here:
In Step 4 choose
HuffmanZip
under
Launch Configuration
and
Browse
to the project's main folder (
HuffmanZip/
) and save the jar under the name
huffzip.jar
.
The program can now be run in the terminal as follows (copy the full line):
THE_FULL_PATH_TO_JAVA -jar huffzip.jar -encode war-and-peace.txt (produces war-and-peace.txt.hz)
THE_FULL_PATH_TO_JAVA -jar huffzip.jar -decode war-and-peace.txt.hz (produces war-and-peace.txt.huz)
where
THE_FULL_PATH_TO_JAVA
varies based on your installation; try the following:
Check the file sizes as shown above. Here is a sample session:
Turn in the same screenshot of your Terminal.