网站分类(利用Python和机器学习进行网站行业分类)

在本教程中,我想解释网站的提取,清理和分类到不同的类别。我将使用python环境运行我的代码进行数据抓取,并使用神经网络对网站进行分类。文本分类是自然语言处理(NLP)在数据科学的许多不同领域中广泛使用的任务之一。一个高效的文本分类器可以使用NLP算法有效地将数据自动区分为类别。文本分类是监督机器学习任务的一个例子,因为标记的数据集包含文本文档,它们的标签被用来训练分类器。文本分类的一些常见技术是:朴素贝叶斯分类器线性分类器支持向量机Bagging模型Boosting模型深度神经网络Web抓取、Web收集或Web数据提取是用于从网站中提取数据。通常,这是通过软件来完成的,该软件可以模拟人的网络操作,从不同的网站收集特定的信息。可以用于web抓取的一些技术是:人工复制粘贴文本模式匹配HTTP编程HTML解析DOM解析垂直聚合语义标注识别计算机视觉网页分析在本教程中,我们将尝试将完整模型实现为三个不同的模块:数据抓取基于关键字的分类,用于创建训练数据集神经网络在实际测试模型中的应用模块1:数据抓取在这个模块中,我将使用Python 3.5环境来实现我的脚本。步骤1:从网站请求数据。要提取web数据,可以使用许多不同的包,但是在本教程中,我将使用requests。import requestsurl=’https://medium.com/’try: page = requests.get(url) #to extract page from website html_code = page.content #to extract html code from pageexcept Exception as e: print(e)在上面的代码中,requests.get()方法将使用https协议从网站请求页面并将页面加载到对象“ page”中。下一行代码将HTML代码移动到String html_code。所以到目前为止我们已经从网站中提取了数据,但它仍然是HTML格式,这与实际文本有很大不同。第2步:从HTML页面中提取文本要从HTML页面中提取完整的文本数据,我们有两个非常受欢迎的包,BeautifulSoup和html2text。使用上一步中找到的html_code字符串,我们可以应用以下两种方法中的任何一种。from bs4 import BeautifulSouptry: soup = BeautifulSoup(html_code, ‘html.parser’) #Parse html code texts = soup.findAll(text=True) #find all text text_from_html’ ‘.join(texts) #join all textexcept Exception as e: print(e)在上面的代码片段中,BeautifulSoup包将解析HTML代码并将数据分配给soup 对象。findall ()函数从代码中查找所有可见文本,并返回我们存储在文本中的字符串列表。最后,我们使用join()函数将所有单个文本连接到一个公共字符串。import html2texth = html2text.HTML2Text() #Initializing objecth.ignore_links = True #Giving attributes try: text = h.handle(html_code) #handling the HTML code text_from_html=text.replace(“\n”,” “) #replacing next line charexcept Exception as e: print(e)在这个块中,我们使用package html2text解析字符串并直接从HTML代码中获取文本。我们还需要用空格替换空行,最后找到text_from_html。类似地,我们可以对1000多个url使用循环,并从这些站点中提取数据,并以csv(逗号分隔文件)格式存储它们,我们可以在分类模块中进一步使用这种格式。模块二:基于关键词的分类对于任何机器学习算法,我们都需要一些训练集和测试集来训练模型并测试该模型的准确性。因此要为模型创建一组数据,我们已经有了来自不同网站的文本,我们将根据关键词进行分类,然后将结果应用到下一个模块中。在本教程中,我们将把网站分为三类:技术,办公室和教育产品网站(Class_1)消费品网站(Class_2)工业工具和硬件产品网站(Class_3)这里的方法是我们将拥有属于特定类别的某些关键字,我们将这些关键字与文本匹配,并找到具有最大匹配值的类。Matching_value =(与一个行业匹配的关键字数量)/(匹配的关键字总数)因此,我们有一个关键字列表,如下所示:Class_1_keywords = [‘Office’, ‘School’, ‘phone’, ‘Technology’, ‘Electronics’, ‘Cell’, ‘Business’, ‘Education’, ‘Classroom’]Class_1_keywords = [‘Restaurant’, ‘Hospitality’, ‘Tub’, ‘Drain’, ‘Pool’, ‘Filtration’, ‘Floor’, ‘Restroom’, ‘Consumer’, ‘Care’, ‘Bags’, ‘Disposables’]Class_3_keywords = [‘Pull’, ‘Lifts’, ‘Pneumatic’, ‘Emergency’, ‘Finishing’, ‘Hydraulic’, ‘Lockout’, ‘Towers’, ‘Drywall’, ‘Tools’, ‘Packaging’, ‘Measure’, ‘Tag ‘]keywords=Class_1_keywords + Class_2_keywords + Class_3_keywords现在,我们将使用KeywordProcessor从url中查找收到的文本中的关键字。在pypi上的flashtext包中可以使用KeywordProcessor。from flashtext.keyword import KeywordProcessorkp0=KeywordProcessor()for word in keywords: kp0.add_keyword(word)kp1=KeywordProcessor()for word in Class_1_keywords: kp1.add_keyword(word)kp2=KeywordProcessor()for word in Class_2_keywords: kp2.add_keyword(word)kp3=KeywordProcessor()for word in Class_3_keywords: kp3.add_keyword(word)在上面的代码中,我们将使用关键字加载KeywordProcessor对象,我们将进一步使用这些关键字来查找匹配的关键字。要查找Matching_value的百分比值,我们定义一个函数百分比,Python实现如下所示:def percentage1(dum0,dumx): try: ans=float(dumx)/float(dum0) ans=ans*100 except: return 0 else: return ans我们现在将使用extract_keywords(string)方法来查找文本中存在的关键字。我们将找到该列表的长度以查找文本中匹配关键字的数量。以下Python实现的函数将查找百分比,并选择具有最大百分比的类。def find_class: x=str(text_from_html) y0 = len(kp0.extract_keywords(x)) y1 = len(kp1.extract_keywords(x)) y2 = len(kp2.extract_keywords(x)) y3 = len(kp3.extract_keywords(x)) Total_matches=y0 per1 = float(percentage1(y0,y1)) per2 = float(percentage1(y0,y2)) per3 = float(percentage1(y0,y3)) if y0==0: Category=’None’ else: if per1>=per2 and per1>=per3: Category=’Class_1′ elif per2>=per3 and per2>=per1: Category=’Class_2′ elif per3>=per1 and per3>=per2: Category=’Class_3′ return Category使用上述函数的循环,我们可以根据关键字找到所有网站的类别。我们将分类数据保存到Data.csv文件中,我们将进一步使用。所以现在我们已准备好数据集,用于应用神经网络进行分类。模块3:应用神经网络在下面的实现中,我们将从头开始创建一个神经网络,并使用NLTK字标记器进行预处理。首先,我们需要导入从上述步骤中获得的数据集并将其加载到列表中。import pandas as pddata=pd.read_csv(‘Data.csv’)data = data[pd.notnull(data[‘tokenized_source’])]data=data[data.Category != ‘None’]上面的代码将加载和清理分类数据。NULL值将被删除。以下代码将针对其类创建DATA字典。for index,row in data.iterrows(): train_data.append({“class”:row[“Category”], “sentence”:row[“text”]})为了应用神经网络,我们需要将语言文字转换成数学符号,用于计算。我们将在所有字符串中形成所有单词的列表。words = []classes = []documents = []ignore_words = [‘?’]# loop through each sentence in our training datafor pattern in training_data: # tokenize each word in the sentence w = nltk.word_tokenize(pattern[‘sentence’]) # add to our words list words.extend(w) # add to documents in our corpus documents.append((w, pattern[‘class’])) # add to our classes list if pattern[‘class’] not in classes: classes.append(pattern[‘class’])# stem and lower each word and remove duplicateswords = [stemmer.stem(w.lower()) for w in words if w not in ignore_words]words = list(set(words))# remove duplicatesclasses = list(set(classes))print (len(documents), “documents”)print (len(classes), “classes”, classes)print (len(words), “unique stemmed words”, words)例如,输出将是:1594 documents3 classes [‘Class_1’, ‘Class_3’, ‘Class_2’]unique stemmed words 40000现在,我们将为模式创建一个标记化单词列表,并使用NLTK Lancaster Stemmer创建一个bag of words。from nltk.stem.lancaster import LancasterStemmerstemmer = LancasterStemmer()# create our training datatraining = []output = []# create an empty array for our outputoutput_empty = [0] * len(classes)# training set, bag of words for each sentencefor doc in documents: # initialize our bag of words bag = [] # list of tokenized words for the pattern pattern_words = doc[0] # stem each word pattern_words = [stemmer.stem(word.lower()) for word in pattern_words] # create our bag of words array for w in words: bag.append(1) if w in pattern_words else bag.append(0) training.append(bag) # output is a ‘0’ for each tag and ‘1’ for current tag output_row = list(output_empty) output_row[classes.index(doc[1])] = 1 output.append(output_row)print (“# words”, len(words))print (“# classes”, len(classes))输出:# words 41468# classes 3现在,我们对数据进行最终预处理并创建一些函数。Sigmoid Functiondef sigmoid(x): output = 1/(1+np.exp(-x)) return output# convert output of sigmoid function to its derivativedef sigmoid_output_to_derivative(output): return output*(1-output)Cleaning functiondef clean_up_sentence(sentence): # tokenize the pattern sentence_words = nltk.word_tokenize(sentence) # stem each word sentence_words = [stemmer.stem(word.lower()) for word in sentence_words] return sentence_wordsBag Of Words functiondef bow(sentence, words, show_details=False): # tokenize the pattern sentence_words = clean_up_sentence(sentence) # bag of words bag = [0]*len(words) for s in sentence_words: for i,w in enumerate(words): if w == s: bag[i] = 1 if show_details: print (“found in bag: %s” % w) return(np.array(bag))将在神经网络中使用的最终函数: Think functiondef think(sentence, show_details=False): x = bow(sentence.lower(), words, show_details) if show_details: print (“sentence:”, sentence, “\n bow:”, x) # input layer is our bag of words l0 = x # matrix multiplication of input and hidden layer l1 = sigmoid(np.dot(l0, synapse_0)) # output layer l2 = sigmoid(np.dot(l1, synapse_1)) return l2现在我们准备训练的神经网络模型。我们将通过scratch实现它,并将使用逻辑回归到每个神经元。只有一层,但有50000个epochs,我们将训练我们的模型。完整的训练示例将在CPU上运行。def train(X, y, hidden_neurons=10, alpha=1, epochs=50000, dropout=False, dropout_percent=0.5): print (“Training with %s neurons, alpha:%s, dropout:%s %s” % (hidden_neurons, str(alpha), dropout, dropout_percent if dropout else ”) ) print (“Input matrix: %sx%s Output matrix: %sx%s” % (len(X),len(X[0]),1, len(classes)) ) np.random.seed(1) last_mean_error = 1 # randomly initialize our weights with mean 0 synapse_0 = 2*np.random.random((len(X[0]), hidden_neurons)) – 1 synapse_1 = 2*np.random.random((hidden_neurons, len(classes))) – 1 prev_synapse_0_weight_update = np.zeros_like(synapse_0) prev_synapse_1_weight_update = np.zeros_like(synapse_1) synapse_0_direction_count = np.zeros_like(synapse_0) synapse_1_direction_count = np.zeros_like(synapse_1) for j in iter(range(epochs+1)): # Feed forward through layers 0, 1, and 2 layer_0 = X layer_1 = sigmoid(np.dot(layer_0, synapse_0)) if(dropout): layer_1 *= np.random.binomial([np.ones((len(X),hidden_neurons))],1-dropout_percent)[0] * (1.0/(1-dropout_percent)) layer_2 = sigmoid(np.dot(layer_1, synapse_1)) # how much did we miss the target value? layer_2_error = y – layer_2 if (j% 10000) == 0 and j > 5000: # if this 10k iteration’s error is greater than the last iteration, break out if np.mean(np.abs(layer_2_error)) < last_mean_error: print (“delta after “+str(j)+” iterations:” + str(np.mean(np.abs(layer_2_error))) ) last_mean_error = np.mean(np.abs(layer_2_error)) else: print (“break:”, np.mean(np.abs(layer_2_error)), “>”, last_mean_error ) break # in what direction is the target value? # were we really sure? if so, don’t change too much. layer_2_delta = layer_2_error * sigmoid_output_to_derivative(layer_2) # how much did each l1 value contribute to the l2 error (according to the weights)? layer_1_error = layer_2_delta.dot(synapse_1.T) # in what direction is the target l1? # were we really sure? if so, don’t change too much. layer_1_delta = layer_1_error * sigmoid_output_to_derivative(layer_1) synapse_1_weight_update = (layer_1.T.dot(layer_2_delta)) synapse_0_weight_update = (layer_0.T.dot(layer_1_delta)) if(j > 0): synapse_0_direction_count += np.abs(((synapse_0_weight_update > 0)+0) – ((prev_synapse_0_weight_update > 0) + 0)) synapse_1_direction_count += np.abs(((synapse_1_weight_update > 0)+0) – ((prev_synapse_1_weight_update > 0) + 0)) synapse_1 += alpha * synapse_1_weight_update synapse_0 += alpha * synapse_0_weight_update prev_synapse_0_weight_update = synapse_0_weight_update prev_synapse_1_weight_update = synapse_1_weight_update now = datetime.datetime.now() # persist synapses synapse = {‘synapse0’: synapse_0.tolist(), ‘synapse1’: synapse_1.tolist(), ‘datetime’: now.strftime(“%Y-%m-%d %H:%M”), ‘words’: words, ‘classes’: classes } synapse_file = “synapses.json” with open(folder_path+synapse_file, ‘w’) as outfile: json.dump(synapse, outfile, indent=4, sort_keys=True) print (“saved synapses to:”, synapse_file)最后我们将训练模型:import timeX = np.array(training)y = np.array(output)start_time = time.time()train(X, y, hidden_neurons=10, alpha=0.1, epochs=50000, dropout=False, dropout_percent=0.2)elapsed_time = time.time() – start_timeprint (“processing time:”, elapsed_time, “seconds”)输出:Training with 10 neurons, alpha:0.1, dropout:False Input matrix: 1594×41468 Output matrix: 1x3delta after 10000 iterations:0.0665105275385delta after 20000 iterations:0.0610711168863delta after 30000 iterations:0.0561908365355delta after 40000 iterations:0.0533465919346delta after 50000 iterations:0.0461560407785saved synapses to: synapses.jsonprocessing time: 33060.51151227951 seconds正如我们所看到的,训练模型花了将近11个小时。经过如此密集的计算,我们准备测试数据。测试数据的函数:# probability thresholdERROR_THRESHOLD = 0.2# load our calculated synapse valuessynapse_file = ‘synapses.json’ with open(synapse_file) as data_file: synapse = json.load(data_file) synapse_0 = np.asarray(synapse[‘synapse0’]) synapse_1 = np.asarray(synapse[‘synapse1’])def classify(sentence, show_details=False): results = think(sentence, show_details) results = [[i,r] for i,r in enumerate(results) if r>ERROR_THRESHOLD ] results.sort(key=lambda x: x[1], reverse=True) return_results =[[classes[r[0]],r[1]] for r in results] #print (“\n classification: %s” % ( return_results)) return return_results让我们测试模型的准确性:classify(“Switchboards Help KA36200 About Us JavaScript seems to be disabled in your browser You must have JavaScript enabled in your browser to utilize the functionality of this website Help Shopping Cart 0 00 You have no items in your shopping cart My Account My Wishlist My Cart My Quote Log In BD Electrical Worldwide Supply Remanufacturing the past SUSTAINING THE FUTURE Hours and Location Michigan Howell”)Output:[[‘Class_3’, 0.97663437888614435]]如您所见,我们在这些测试中获得了相当高的准确性。在这种只有一层的模型中,精度约为95%+被认为是非常准确的。对于不同模型的进一步分类,我们可以使用Keras或Tensorflow。为了减少训练模型的时间,我们可以使用NVIDIA GPU。现在,我们可以使用Back Propogation在Deep Neural Network的帮助下轻松抓取数据并对其类别进行分类。

本文出自快速备案,转载时请注明出处及相应链接。

本文永久链接: https://www.175ku.com/32297.html