This assignment involves using a binary search tree (BST) to keep track of all words in a text document. It produces a cross-reference, or a concordance. This is very much like assignment 4, except that you must use a different data structure. You may use some of the code you wrote for that assignment, such as input parsing, for this one.
Remember that in a binary search tree, the value to the left of the root is less than the root and the value to the right is greater than the root.
The program will ask for the name of a text file. It will then read the file and keep track each of the words in the file, the number of times it occurs, and which line numbers contain the word. If a word occurs more than once in a line, count it more than once but do not duplicate the line number. Words in the document are separated by spaces and punctuation, which are the following: ? , . !;:-. That is, question mark, period, comma, exclamation point, semicolon, colon, and hyphen. Ignore parentheses and quotation marks. Contractions such as “don’t” are considered a single word. Your test data will not contain numbers. It may contain blank lines, which count in the line numbering but which, containing no words, are ignored. Plurals and variations of a word are considered different. Ignore capitalization; Word and word are the same. Your program will exit after printing the output.
Since part of this is learning how to use classes, you will have to create your own binary search tree node and binary search tree classes. You may not use those classes from the textbook nor any other source. Write only those functions you need to fulfill the assignment.
Remember how string functions work. Remember to write small “helper” functions for various tasks.
Print the text as you read it, preceded by a line number. Once you have reached the end of the file, print a blank line, then the output.
Your output will be the words in alphabetical order, the number of times the word occurs in the file, and the line numbers on which it occurs. Separate the line numbers with a comma and a space, as shown. To make things a little more interesting, ignore the following words: the, a, an.
Sample output for the above paragraph would start thus:
alphabetical 1 1
and 1 1
…
word 1 1
words 2 1,3
your 1 1
At the end print the total number of words, the total number of unique words, and the total number of lines.
import re
file = open("C:\data.txt", "rt")
data = file.read()
word_list=re.split('; |, |:| | ? |- | ! | .',data)
print('Number of words in text file :', len(words))
unique=0
for word in word_list:
if word not in word_list:
unique=unique+1
print('Number of unique words in text file :', unique)
Content = file.read()
CoList = Content.split("\n")
for i in CoList:
if i:
Counter += 1
print("This is the number of lines in the file")
print(Counter)
Get Answers For Free
Most questions answered within 1 hours.