Site Loader
Rock Street, San Francisco

ABSTRACT

The systems and software in the current
scenario being developed are having a high probability of being attacked. So
the need to develop tools that would prevent such kind of attacks is increased.
Many ideas of detecting vulnerabilities have been in use. In this project we
present the idea of detecting the vulnerabilities in text based scenario like
mailing system with an improved accuracy and also prevent the sending of a
vulnerable content to others. The concepts of Cross-validation and Ensembling
are used in order to achieve the improved accuracy. The dependent or the target
variables are chosen in a way such that models are obtained in an efficient
manner. The improved efficiency is determined using the accuracy component of
the testing dataset and is proved in this phase of the project. The concept of
ensembling and prevention of the vulnerable data from the source is to be
proposed proved in the next phase.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

                                                                                               

 

 

 

 

 

 

 

 

TABLE OF CONTENTS

CHAPTER NO         

TITLE

  
PAGE NO
 

 

ABSTRACT
 

IV

 

LIST OF FIGURES
 

VII

1.

INTRODUCTION
 

 

 

1.1  
OVERVIEW OF THE PROJECT
 

1

 

1.2
PROBLEM STATEMENT
 

2

 

1.3
CHALLENGES AND SCOPE
 
1.4
ORGANIZATION OF THE REPORT
 
      

2
 
2
 
 

2.

LITERATURE  SURVEY
 

 

 

2.1
REVIEW        
 

3

3.

SYSTEM
ANALYSIS

 

 

 

 

 

3.1
EXISITING SYSTEM

5

 

 

 

 

3.2
PROPOSED SYSTEM

5

 

 

 

4.

DESIGN AND IMPLEMENTATION
 

 

 

4.1
OVERALL DESCRIPTION
 
4.2
ARCHITECTURE DIAGRAM
 

6
 
6

 

4.3
LIST OF MODULES
 

7

 

      4.3.1 PRE-PROCESSING
 

7

 

      4.3.2 CLUSTERING
 

7

 

      4.3.3 TUNING
 

8

 

      3.3.4 TRAINING
 

8

 

      3.3.5 EVALUATION
 

9

4.

DEVELOPMENT ENVIRONMENT

 

 

 

 

 

HARDWARE REQUIREMENTS

10

 

 

 

 

SOFTWARE REQUIREMENTS

10

 

 

 

5.

RESULTS AND DISCUSSION

 

 

 

 

 

5.1
EVALUATION METRIC
 

11

 

5.2
ANALYSIS OF RESULTS
 

11

6.

CONCLUSION AND FUTURE WORK
 

12

7.

OUTPUT OF MODULES

13

 

 

 

8.

REFERENCES

18

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

LIST OF FIGURES

FIGURE

NAME
OF FIGURE    

PAGENO
 

4.2

ARCHITECTURE DIAGRAM

5

8.1                                   

DATASETS
 

13
 

8.2

TEST DATA          

14

 

 

 

8.3

TRAIN DATA

15

 

 

 

8.4

MODEL

16

 

 

 

8.5                                           

TUNEGRID

17

 

 

 

8.6

RESULT
 

17

 

 

 

 

 

 

 

 

 

 

 

 

CHAPTER 1

INTRODUCTION

1.1 OVERVIEW OF THE PROJECT

A massive number of web applications and services has
been used in financial and banking services, government, healthcare, retail and
many other fields. This because web applications and services offer some
important advantages including the accessibility from different locations and
devices, the enhancement of user interaction and the improvement in the quality
of the services provided to users. In most of these applications, developers
focus on usability and functionality while security usually comes as an
afterthought, a situation which increases the number of vulnerabilities in the
web applications.

As the statistics indicate, it is hard to develop a fully
reliable software. Thus, it is important to test software components to increase
the level of assurance that software components are free of security
vulnerabilities. However, testing resources such as testers and time are
limited. Also, most of the vulnerable components are due to import functions
call and the improper handling of user input. This increases the difficulty of vulnerabilities
discovery. For example, in PHP using unfiltered input as a parameter to $_GET
or $_POST might allow a malicious user to execute SQL Injection attack, while
the call of echo function with an invalidated user input might exploits a Cross-site
scripting (XSS) vulnerability. To solve this problem many models and tools have
been developed to predict vulnerabilities in a software component. Typically
such methods depend on parsing the code and are limited to fixed and very small
patterns, and hardly adapt to variations. The static analysis methods, which
are also used for vulnerability detection, have a high rate of false positive
and false negative in vulnerability detection phase . A wide variation of data mining
and machine learning techniques has been used to improve the ability to predict
web application vulnerabilities. For instance, feature extraction and
classification are used to predict if SQL injection vulnerability resided in
the software or not. Additionally, machine learning methods are used to increase
the ability to cover a wide range of malicious web code.

 

                                                                                       

 

 

1.2 PROBLEM STATEMENT

The recent developments
in text mining have resulted in accuracy rate of 90% which thereby indicates
that there is still more room for improvement and also reduce the false positive
rate. Also normal single case models do not give the exact result. The false
positive rates go on increasing with number of data in the datasets. The actual
accurate classifiers need to be determined in order to get better results.

Hence this project
includes the method of Ensembling which combines various methods and models
available in data mining and also preventing the vulnerable data from the
source using text mining techniques.

 

1.3 CHALLENGES AND SCOPE

·        
The accuracy achieved through this project
is 94% which can be increased further.

·        
The classifiers considered can be
changed further to improve efficiency.

·        
The proposed project is subject to text
mining and so still other mining techniques like spatial and correlation
techniques can be used.

 

1.4 ORGANIZATION OF THE REPORT

Rest of the thesis is
organized as follows, the second chapter deals with literature survey for the
thesis which is followed by the third chapter containing Detailed Design and
Implementation details. The fourth chapter provides detailed result analysis
followed by future work and conclusion in fifth chapter.

 

 

 

 

 

 

 

CHAPTER
2

LITERATURE SURVEY

2.1 REVIEW     

The Symantec
Corporation security report statistics for 2015 showed that 78% of websites
have at least one vulnerability, where 15% of the vulnerabilities are critical
ones. The statistics from White-Hat report for 2016 showed that the average
number of vulnerabilities per site is 23, of which 13 are critical
vulnerabilities. Also, it showed that vulnerabilities stay open for a very long
time. Critical vulnerabilities have an average age of 300 days. These results
indicate that the web applications still contain many vulnerabilities. A study
was conducted to find why there are so many vulnerabilities in web application.
The main reason is building stateful applications on the web stateless
infrastructure. The web servers are designed to be stateless, and HTTP which is
the main protocol used for communication between the server and the client is a
stateless protocol. Each HTTP request is processed independently at the server,
even if there are two related requests they didn’t contain information about
each other. However, most of the web application are stateful and the server
should be able to recognize the dependency between the requests. Sessions and
cookies have been used to solve this problem. However, there are four security
properties that sessions do not emulate: preservation of trust state, data integrity,
code integrity, and session integrity. Therefore, dealing with a stateless
framework without preserving these security properties will increase the number
of vulnerabilities if a stateful application is running under this framework. According
to Open Web Application Security Project (OWASP) the top 10 security risks for
2013 were the following: 1- Injection, 2- Broken Authentication and Session Management,
3- Cross-Site Scripting (XSS), 4- Insecure Direct Object References, 5-
Security Misconfiguration, 6- Sensitive Data Exposure, 7- Missing Function
Level Access Control, 8- Cross-Site Request Forgery (CSRF), 9- Using Components
with Known Vulnerabilities, 10- Unvalidated Redirects and Forwards. Table 1
shows the exploitability, impact and what the attacker can do if the
vulnerability has been exploited. Data mining is used in many applications. The
most successful and popular applications are Business Intelligence and Search
Engines. However, in the recent years, data mining techniques have also been
used in security.

Data mining techniques
added enhancements to improve the security field. For example, in they proposed
a model for improving detection of malicious spam in educational institutes.
They used a set of data mining techniques like feature extraction, feature
selection and different classifiers: Naïve Bayes, Support Vector Machine and
Multilayer Perceptron. Usually, data mining techniques are used first to
extract rules and features from the dataset, then select the best rules of
features to be used as parameters of the classifier. The classifier is trained,
then used to classify new instances. This scenario is used in the applications
mentioned above and in other applications, one of these applications in the
security field is vulnerability detection, which this paper is talking about.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

CHAPTER 3

DESIGN AND IMPLEMENTATION

3.1 EXISTING SYSTEM

The data mining technique that is being
used comprise of a model that helps in training the train data set. The model
is made up of techniques without any Cross Validations and repeats. Hence the obtained
accuracy is around 92%. The false positive rate is also high. Though all kind
of vulnerabilities are considered, the results of all vulnerabilities are of
the same accuracy. The vulnerabilities include XSS, SQL Injection.    

3.2 PROPOSED SYSTEM

The efficient techniques such as Cross
Validations are used with a good number of repeats. Also, the timing is
improved with the help of workers being assigned based on the number of core
processors. The model applied in the trained dataset is then converted to tune
grid. The tune grid determines the efficient parameters that should be
considered while any data set is passed. So on passing the test data to the
obtained model an accuracy of 94% is obtained.

 

 

CHAPTER 4

DESIGN AND IMPLEMENTATION

4.1 OVERALL DESCRIPTION

The work aims at improving the accuracy
of determining the vulnerable data thereby preventing the users from accessing
it and protecting the system. The preprocessing of the dataset is done in order
to clean the data from unwanted information. The cleaned data is then taken and
trained and the training model is obtained. The necessary packages are
installed and added to the project. The training data set and the testing data
set is split from the available data set. The trained model is taken and the
testing data is passed through it to obtain the prediction and the accuracy.

 

4.2 ARCHITECTURE
DIAGRAM

 

4.3 LIST OF MODULES

Pre-processing
Clustering
Tuning
Training
Evaluation

 

4.3.1
PRE-PROCESSING

The dataset is taken from the already available
datasets in the open source. The dataset comprises of 4600 observations. Each
of these observations includes 58 variables which determines what are the words
and the frequency of them. The set of names which act as column names in
dataset are available separately and are associated together to confirm the
validity of the data. Thus the final obtained dataset in this pre-processing
method can be applied or used directly to perform the operations.

dataset

Post Author: admin

x

Hi!
I'm Jeremy!

Would you like to get a custom essay? How about receiving a customized one?

Check it out