






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Detecting fake news in python language with machine learning tools
Typology: Papers
1 / 10
This page cannot be seen from the preview
Don't miss anything!
Uma Sharma, Sidarth Saran, Shankar M. Patil Department of Information Technology Bharati Vidyapeeth College of Engineering Navi Mumbai, India
Abstract — In our modern era where the internet is ubiquitous, everyone relies on various online resources for news. Along with the increase in the use of social media platforms like Facebook, Twitter, etc. news spread rapidly among millions of users within a very short span of time. The spread of fake news has far-reaching consequences like the creation of biased opinions to swaying election outcomes for the benefit of certain candidates. Moreover, spammers use appealing news headlines to generate revenue using advertisements via click- baits. In this paper, we aim to perform binary classification of various news articles available online with the help of concepts pertaining to Artificial Intelligence, Natural Language Processing and Machine Learning. We aim to provide the user with the ability to classify the news as fake or real and also check the authenticity of the website publishing the news.
Keywords—Internet, Social Media, Fake News, Classification, Artificial Intelligence, Machine Learning, Websites, Authenticity.
I. INTRODUCTION
As an increasing amount of our lives is spent interacting online through social media platforms, more and more people tend to hunt out and consume news from social media instead of traditional news organizations.[1] The explanations for this alteration in consumption behaviours are inherent within the nature of those social media platforms: (i) it's often more timely and fewer expensive to consume news on social media compared with traditional journalism , like newspapers or television; and (ii) it's easier to further share, discuss , and discuss the news with friends or other readers on social media. For instance, 62 percent of U.S. adults get news on social media in 2016, while in 2012; only 49 percent reported seeing news on social media [1]. It had been also found that social
media now outperforms television because the major news source. Despite the benefits provided by social media, the standard of stories on social media is less than traditional news organizations. However, because it's inexpensive to supply news online and far faster and easier to propagate through social media, large volumes of faux news, i.e., those news articles with intentionally false information, are produced online for a spread of purposes, like financial and political gain. it had been estimated that over 1 million tweets are associated with fake news ―Pizzagate" by the top of the presidential election. Given the prevalence of this new phenomenon, ―Fake news" was even named the word of the year by the Macquarie dictionary in 2016 [2]. The extensive spread of faux news can have a significant negative impact on individuals and society. First, fake news can shatter the authenticity equilibrium of the news ecosystem for instance; it's evident that the most popular fake news was even more outspread on Facebook than the most accepted genuine mainstream news during the U.S. 2016 presidential election. Second, fake news intentionally persuades consumers to simply accept biased or false beliefs. Fake news is typically manipulated by propagandists to convey political messages or influence for instance, some report shows that Russia has created fake accounts and social bots to spread false stories. Third, fake news changes the way people interpret and answer real news, for instance, some fake news was just created to trigger people's distrust and make them confused; impeding their abilities to differentiate what's true from what's not. To assist mitigate the negative effects caused by fake news (both to profit the general public and therefore the news ecosystem). It's crucial that we build up methods to automatically detect fake news broadcast on social media [3]. Internet and social media have made the access to the news information much easier and comfortable [2].
ISSN: 2278-
Published by, www.ijert.org
NTASU - 2020 Conference Proceedings
Volume 9, Issue 3
Often Internet users can pursue the events of their concern in online form, and increased number of the mobile devices makes this process even easier. But with great possibilities come great challenges. Mass media have an enormous influence on the society, and because it often happens, there's someone who wants to require advantage of this fact. Sometimes to realize some goals mass-media may manipulate the knowledge in several ways. This result in producing of the news articles that isn‘t completely true or maybe completely false. There even exist many websites that produce fake news almost exclusively. They intentionally publish hoaxes, half-truths, propaganda and disinformation asserting to be real news – often using social media to drive web traffic and magnify their effect. The most goals of faux news websites are to affect the general public opinion on certain matters (mostly political). Samples of such websites could also be found in Ukraine, United States of America, Germany, China and much of other countries [4]. Thus, fake news may be a global issue also as a worldwide challenge. Many scientists believe that fake news issue could also be addressed by means of machine learning and AI [5]. There‘s a reason for that: recently AI algorithms have begun to work far better on many classification problems (image recognition, voice detection then on) because hardware is cheaper and larger datasets are available. There are several influential articles about automatic deception detection. In [6] the authors provide a general overview of the available techniques for the matter. In [7] the authors describe their method for fake news detection supported the feedback for the precise news within the micro blogs. In [8] the authors actually develop two systems for deception detection supported support vector machines and Naive Bayes classifier (this method is employed within the system described during this paper as well) respectively. They collect the info by means of asking people to directly provide true or false information on several topics – abortion, execution and friendship. The accuracy of the detection achieved by the system is around 70%. This text describes an easy fake news detection method supported one among the synthetic intelligence algorithms – naïve Bayes classifier, Random Forest and Logistic Regression. The goal of the research is to look at how these particular methods work for this particular problem given a manually labelled news dataset and to support (or not) the thought of using AI for fake news detection. The difference between these article and articles on the similar topics is that during this paper Logistic Regression was specifically used for fake news detection; also, the developed system was tested on a comparatively new data set, which
gave a chance to gauge its performance on a recent data. A. Characteristics of Fake News: They often have grammatical mistakes. They are often emotionally coloured. They often try to affect readers‘ opinion on some topics. Their content is not always true. They often use attention seeking words and news format and click baits. They are too good to be true. Their sources are not genuine most of the times [9].
II. LITERATURE REVIEW Mykhailo Granik et. al. in their paper [3] shows a simple approach for fake news detection using naive Bayes classifier. This approach was implemented as a software system and tested against a data set of Facebook news posts. They were collected from three large Facebook pages each from the right and from the left, as well as three large mainstream political news pages (Politico, CNN, ABC News). They achieved classification accuracy of approximately 74%. Classification accuracy for fake news is slightly worse. This may be caused by the skewness of the dataset: only 4.9% of it is fake news. Himank Gupta et. al. [10] gave a framework based on different machine learning approach that deals with various problems including accuracy shortage, time lag (BotMaker) and high processing time to handle thousands of tweets in 1 sec. Firstly, they have collected 400,000 tweets from HSpam14 dataset. Then they further characterize the 150,000 spam tweets and 250,000 non- spam tweets. They also derived some lightweight features along with the Top-30 words that are providing highest information gain from Bag-of- Words model. 4. They were able to achieve an accuracy of 91.65% and surpassed the existing solution by approximately18%. Marco L. Della Vedova et. al. [11] first proposed a novel ML fake news detection method which, by combining news content and social context features, outperforms existing methods in the literature, increasing its accuracy up to 78.8%. Second, they implemented their method within a Facebook Messenger Chabot and validate it with a real-world application, obtaining a fake news detection accuracy of 81.7%. Their goal was to classify a news item as reliable or fake; they first described the datasets they used for their test, then presented the content-based approach they implemented and the method they proposed to combine it with a social-based approach available in the literature. The resulting dataset is composed of 15,500 posts, coming from 32 pages ( conspiracy pages, 18 scientific pages), with more than
ISSN: 2278-
Published by, www.ijert.org
NTASU - 2020 Conference Proceedings
Volume 9, Issue 3
classify the domain it simply states that the news aggregator does not exist.
IV. IMPLEMENTATION
4.1 DATA COLLECTION AND ANALYSIS
We can get online news from different sources like social media websites, search engine, homepage of news agency websites or the fact-checking websites. On the Internet, there are a few publicly available datasets for Fake news classification like Buzzfeed News, LIAR [15], BS Detector etc. These datasets have been widely used in different research papers for determining the veracity of news. In the following sections, I have discussed in brief about the sources of the dataset used in this work.
Online news can be collected from different sources, such as news agency homepages, search engines, and social media websites. However, manually determining the veracity of news is a challenging task, usually requiring annotators with domain expertise who performs careful analysis of claims and additional evidence, context, and reports from authoritative sources. Generally, news data with annotations can be gathered in the following ways: Expert journalists, Fact-checking websites, Industry detectors, and Crowd sourced workers. However, there are no agreed upon benchmark datasets for the fake news detection problem. Data gathered must be pre-processed- that is, cleaned, transformed and integrated before it can undergo training process [16]. The dataset that we used is explained below:
LIAR: This dataset is collected from fact-checking website PolitiFact through its API [15]. It includes 12,836 human labelled short statements, which are sampled from various contexts, such as news releases, TV or radio interviews, campaign speeches, etc. The labels for news truthfulness are fine-grained multiple classes: pants-fire, false, barely-true, half-true, mostly true, and true.
The data source used for this project is LIAR dataset which contains 3 files with .csv format for test, train and validation. Below is some description about the data files used for this project.
William Yang Wang, ―Liar, Liar Pants on Fire‖: A New Benchmark Dataset for Fake News Detection, to appear in Proceedings of the 55th^ Annual Meeting of the Association for Computational Linguistics (ACL 2017), short paper, Vancouver, BC, Canada, July 30- August 4, ACL.
Below are the columns used to create 3 datasets that have been in used in this project- ● Column1: Statement (News headline or text). ● Column2: Label (Label class contains: True, False) The dataset used for this project were in csv format named train.csv, test.csv and valid.csv.
4.2 DEFINITIONS AND DETAILS A. Pre-processing Data Social media data is highly unstructured – majority of them are informal communication with typos, slangs and bad-grammar etc. [17]. Quest for increased performance and reliability has made it imperative to develop techniques for utilization of resources to make informed decisions [18]. To achieve better insights, it is necessary to clean the data before it can be used for predictive modelling. For this purpose, basic pre- processing was done on the News training data. This step was comprised of- Data Cleaning: While reading data, we get data in the structured or unstructured format. A structured format has a well- defined pattern whereas unstructured data has no proper structure. In between the 2 structures, we have a semi-structured format which is a comparably better structured than unstructured format. Cleaning up the text data is necessary to highlight attributes that we‘re going to want our machine learning system to pick up on. Cleaning (or pre- processing) the data typically consists of a number of steps: a) Remove punctuation Punctuation can provide grammatical context to a sentence which supports our understanding. But for our vectorizer which counts the number of words and not the context, it does not add value, so we remove all special characters. eg: How are you?->How are you b) Tokenization Tokenizing separates text into units such as sentences or words. It gives structure to previously unstructured text. eg: Plata o Plomo-> ‗Plata‘,‘o‘,‘Plomo‘. c) Remove stopwords Stopwords are common words that will likely appear in any text. They don‘t tell us much about our data so we remove them. eg: silver or lead is fine for me-> silver, lead, fine. d) Stemming
ISSN: 2278-
Published by, www.ijert.org
NTASU - 2020 Conference Proceedings
Volume 9, Issue 3
Stemming helps reduce a word to its stem form. It often makes sense to treat related words in the same way. It removes suffices, like ―ing‖, ―ly‖, ―s‖, etc. by a simple rule-based approach. It reduces the corpus of words but often the actual words get neglected. eg: Entitling, Entitled -> Entitle. Note: Some search engines treat words with the same stem as synonyms [18].
B. Feature Generation
We can use text data to generate a number of features like word count, frequency of large words, frequency of unique words, n-grams etc. By creating a representation of words that capture their meanings, semantic relationships, and numerous types of context they are used in, we can enable computer to understand text and perform Clustering, Classification etc [19].
Vectorizing Data:
Vectorizing is the process of encoding text as integers i.e. numeric form to create feature vectors so that machine learning algorithms can understand our data.
Bag of Words (BoW) or CountVectorizer describes the presence of words within the text data. It gives a result of 1 if present in the sentence and 0 if not present. It, therefore, creates a bag of words with a document- matrix count in each text document.
N-grams are simply all combinations of adjacent words or letters of length n that we can find in our source text. Ngrams with n=1 are called unigrams. Similarly, bigrams (n=2), trigrams (n=3) and so on can also be used. Unigrams usually don‘t contain much information as compared to bigrams and trigrams. The basic principle behind n-grams is that they capture the letter or word is likely to follow the given word. The longer the n-gram (higher n ), the more context you have to work with [20].
It computes ―relative frequency‖ that a word appears in a document compared to its frequency across all documents TF-IDF weight represents the relative importance of a term in the document and entire corpus [17].
TF stands for Term Frequency: It calculates how frequently a term appears in a document. Since, every document size varies, a term may appear more in a long sized document that a short one. Thus, the length of the document often divides Term frequency.
Note: Used for search engine scoring, text summarization, document clustering.
( )
IDF stands for Inverse Document Frequency: A word is not of much use if it is present in all the documents. Certain terms like ―a‖, ―an‖, ―the‖, ―on‖, ―of‖ etc. appear many times in a document but are of little importance. IDF weighs down the importance of these terms and increase the importance of rare ones. The more the value of IDF, the more unique is the word [17].
( ) (^) )
TF-IDF is applied on the body text, so the relative count of each word in the sentences is stored in the document matrix. ( ) ( ) ( )
Note: Vectorizers outputs sparse matrices. Sparse Matrix is a matrix in which most entries are 0 [21].
C. Algorithms used for Classification This section deals with training the classifier. Different classifiers were investigated to predict the class of the text. We explored specifically four different machine- learning algorithms – Multinomial Naïve Bayes Passive Aggressive Classifier and Logistic regression. The implementations of these classifiers were done using Python library Sci-Kit Learn.
Brief introduction to the algorithms-
This classification technique is based on Bayes theorem, which assumes that the presence of a particular feature in a class is independent of the presence of any other feature. It provides way for calculating the posterior probability.
P(c|x)= posterior probability of class given predictor P(c)= prior probability of class P(x|c)= likelihood (probability of predictor given class) P(x) = prior probability of predictor
ISSN: 2278-
Published by, www.ijert.org
NTASU - 2020 Conference Proceedings
Volume 9, Issue 3
It takes a news article as input from user then model is used for final classification output that is shown to user along with probability of truth.
B. Dynamic Search Implementation-
Our dynamic implementation contains 3 search fields which are-
Search by article content.
Search using key terms.
Search for website in database.
In the first search field we have used Natural Language
Processing for the first search field to come up with a proper solution for the problem, and hence we have attempted to create a model which can classify fake news according to the terms used in the newspaper articles. Our application uses NLP techniques like CountVectorization and TF-IDF Vectorization before passing it through a Passive Aggressive Classifier to output the authenticity as a percentage probability of an article.
The second search field of the site asks for specific keywords to be searched on the net upon which it provides a suitable output for the percentage probability of that term actually being present in an article or a similar article with those keyword references in it.
The third search field of the site accepts a specific website domain name upon which the implementation looks for the site in our true sites database or the blacklisted sites database. The true sites database holds the domain names which regularly provide proper and authentic news and vice versa. If the site isn‘t found in either of the databases then the implementation doesn‘t classify the domain it simply states that the news aggregator does not exist.
Working-
The problem can be broken down into 3 statements-
Use NLP to check the authenticity of a news article.
If the user has a query about the authenticity of a search query then we he/she can directly search on our platform and using our custom algorithm we output a confidence score.
Check the authenticity of a news source.
These sections have been produced as search fields to take inputs in 3 different forms in our implementation of the problem statement.
Evaluate the performance of algorithms for fake news detection problem; various evaluation metrics have
been used. In this subsection, we review the most widely used metrics for fake news detection. Most existing approaches consider the fake news problem as a classification problem that predicts whether a news article is fake or not: True Positive (TP): when predicted fake news pieces are actually classified as fake news; True Negative (TN): when predicted true news pieces are actually classified as true news; False Negative (FN): when predicted true news pieces are actually classified as fake news; False Positive (FP): when predicted fake news pieces are actually classified as true news.
Confusion Matrix: A confusion matrix is a table that is often used to describe the performance of a classification model (or ―classifier‖) on a set of test data for which the true values are known. It allows the visualization of the performance of an algorithm. A confusion matrix is a summary of prediction results on a classification problem. The number of correct and incorrect predictions are summarized with count values and broken down by each class. This is the key to the confusion matrix. The confusion matrix shows the ways in which your classification model is confused when it makes predictions. It gives us insight not only into the errors being made by a classifier but more importantly the types of errors that are being made [26]. Table 1: Confusion Matrix Total Class 1 (Predicted) Class 2 (Predicted) Class 1 (Actual) TP^ FN Class 2 (Actual)
By formulating this as a classification problem, we can define following metrics-
ISSN: 2278-
Published by, www.ijert.org
NTASU - 2020 Conference Proceedings
Volume 9, Issue 3
These metrics are commonly used in the machine learning community and enable us to evaluate the performance of a classifier from different perspectives. Specifically, accuracy measures the similarity between predicted fake news and real fake news.
A. Static System-
Figure 3: Static output (True)
Figure 4: Static Output (False)
B. Dynamic System-
Figure 5: Fake News Detector (Home Screen)
Figure 6: Fake News Detector (Output page)
Implementation was done using the above algorithms with Vector features- Count Vectors and Tf-Idf vectors at Word level and Ngram-level. Accuracy was noted for all models. We used K-fold cross validation technique to improve the effectiveness of the models.
A. Dataset split using K-fold cross validation
This cross-validation technique was used for splitting the dataset randomly into k-folds. (k-1) folds were used for building the model while kth^ fold was used to check the effectiveness of the model. This was repeated until each of the k-folds served as the test set. I used 3-fold cross validation for this experiment where 67% of the data is used for training the model and remaining 33% for testing.
B. Confusion Matrices for Static System After applying various extracted features (Bag-of- words, Tf-Idf. N-grams) on three different classifiers (Naïve bayes, Logistic Regression and Random Forest), their confusion matrix showing actual set and predicted sets are mentioned below:
Table 2: Confusion Matrix for Naïve Bayes Classifier using Tf-Idf features-
Total= 10240
Naïve Bayes Classifier Fake (Predicted) True (Predicted) Fake (Actual) 841 3647 True (Actual) 427 5325
Table 3: Confusion Matrix for Logistic Regresssion using Tf-Idf features-
Total= 10240
Logistic Regression Fake (Predicted) True (Predicted) Fake (Actual) 1617 2871 True (Actual) 1097 4655
Table 4: Confusion Matrix for Random Forest Classifier using Tf-Idf features-
Total= 10240
Random Forest Fake (Predicted) True (Predicted) Fake (Actual) 1979 2509 True (Actual) 1630 4122
Table 5: Comparison of Precision, Recall, F1-scores and Accuracy for all three classifiers- Classifiers Precision Recall F1- Score
Accuracy
Naïve Bayes 0.59 0.92 0.72 0. Random Forest
Logistic Regression
ISSN: 2278-
Published by, www.ijert.org
NTASU - 2020 Conference Proceedings
Volume 9, Issue 3
Engineering (IJRTE) ISSN: 2277-3878, Volume-7, Issue-6, March 2019
[10] H. Gupta, M. S. Jamal, S. Madisetty and M. S. Desarkar, "A framework for real-time spam detection in Twitter," 2018 10th International Conference on Communication Systems & Networks (COMSNETS), Bengaluru, 2018, pp. 380-
[11] M. L. Della Vedova, E. Tacchini, S. Moret, G. Ballarin, M. DiPierro and L. de Alfaro, "Automatic Online Fake News Detection Combining Content and Social Signals," 2018 22nd Conference of Open Innovations Association (FRUCT), Jyvaskyla, 2018, pp. 272-279.
[12] C. Buntain and J. Golbeck, "Automatically Identifying Fake News in Popular Twitter Threads," 2017 IEEE International Conference on Smart Cloud (SmartCloud), New York, NY, 2017, pp. 208-215.
[13] S. B. Parikh and P. K. Atrey, "Media-Rich Fake News Detection: A Survey," 2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), Miami, FL, 2018, pp. 436-
[14] Scikit-Learn- Machine Learning In Python
[15] Dataset- Fake News detection William Yang Wang. " liar, liar pants on _re": A new benchmark dataset for fake news detection. arXiv preprint arXiv:1705.00648, 2017.
[16] Shankar M. Patil, Dr. Praveen Kumar, ―Data mining model for effective data analysis of higher education students using MapReduce‖ IJERMT, April 2017 (Volume-6, Issue-4).
[17] Aayush Ranjan, ― Fake News Detection Using Machine Learning‖, Department Of Computer Science & Engineering Delhi Technological University, July
[18] Patil S.M., Malik A.K. (2019) Correlation Based Real-Time Data Analysis of Graduate Students Behaviour. In: Santosh K., Hegadi R. (eds) Recent Trends in Image Processing and Pattern Recognition. RTIP2R 2018. Communications in Computer and Information Science, vol 1037. Springer, Singapore.
[19] Badreesh Shetty, “Natural Language Processing (NLP) for machine learning‖ at towardsdatascience, Medium.
[20] NLTK 3.5b1 documentation, Nltk generate n gram
[21] Ultimate guide to deal with Text Data (using Python) – for Data Scientists and Engineers by Shubham Jain, February 27, 2018
[22] Understanding the random forest by Anirudh Palaparthi, Jan 28, at analytics vidya.
[23] Understanding the random forest by Anirudh Palaparthi, Jan 28, at analytics vidya. [24]Shailesh-Dhama,―Detecting-Fake-News-with- Python‖, Github, 2019 [25] Aayush Ranjan, ― Fake News Detection Using Machine Learning‖, Department Of Computer Science & Engineering Delhi Technological University, July
[26] What is a Confusion Matrix in Machine Learning by Jason Brownlee on November 18, 2016 in Code Algorithms From Scratch
ISSN: 2278-
Published by, www.ijert.org
NTASU - 2020 Conference Proceedings
Volume 9, Issue 3