B2B E-Mart Fraud Detection & Web Email Extractor

Commercial Fraud is becoming a serious, and unresolved issue as the E-Market is evolving vigorously. Commercial Fraud can be found in many fields, including E-Bank, Insurance, Stock markets, etc. Thus, it is pressing for us to prevent people from getting defrauded. In this post, I will introduce a B2B online E-Mart Fraud Detection detection method using millions of data extracted from the web.

B2B E-Mart Fraud Detection is developed by tracing users’ information, including users’ ID, Name, Email Address and etc. Those fraudulent users can be identified by correlating the existing fraudulent related information, like Email, Phone number, at a multi-dimension level. Thus, it is necessary for us to pull and store oceans of data scraped from websites and analyze them to find a certain abnormal relationship and suspicious visiting behaviors.

As well known, there are two modes of B2B E-Market:

Based on Info Service: This E-Market operates with online transactions.

Based on transactions: This E-Market operates with transactions.

What we deal with is the 2nd mode which may threaten users’ property.

This detection method will classify users’ info into static and dynamic info. For the static info, we use Association Analysis in data mining for detection, while we use Logistic Regression detection in machine learning for dynamic data. Then we synthesize the two analytic results to derive a warning score, and those users whose score is over the warning threshold with a relatively high score will be judged as a defrauder and blacklisted. Here, we should note, users static info includes username, Phone Number, Email Address, Skull info, and comments. Dynamic info contains search history, email sending and receiving behaviors, posting, and others. We can assign different weights to the two classified results, proceed to further analyze the results, and derive the final Warning Threshold.

Static Info Process:

As mentioned before, we use Association Analysis based on Data Mining for our analysis.

First, what comes first is how we can extract oceans of data from target websites. There are many methods we can adopt, like using the public APIs provided by certain websites, building a crawler on our own by programming using Python or Ruby, or we can choose an extraction/scraper tool to automatically extract web data in a structured way. In this writing, I want to propose on popular extraction/scraper tool – Octoparse for your reference. This extraction tool is effective in scraping web data without coding. Its UI is very user-friendly, we just need to click&drag to customize our extraction patterns.

Next, we should purge the extracted data, and make it in a standard format so that we can apply it to Association Analysis Model.

Then, we should select a proper Support Rating and Confidential Threshold. Here, we should note that the confidential degree is to rule the accuracy of the rules, and evaluate the strength of Association Analysis. And Supporting Degree is to answer whether such an association is prevalent or not, and if it is a general or particular.

After extraction of users’ info data, which includes Email Address, Name, Phone number and etc. We can associate the recorded defrauder’s info to associate the suspicious users’ info in multiple dimensions on Email, Phone number and etc. Info belonging to different dimensions or different users group are assigned with different Supporting Degree and Confidential Thresholds. Users who are over the Threshold will be recorded and used for further synthesized analysis.

A synthesized analysis is designed to deal with static data Association Analysis. It will assign different weights to the Association Analysis results, and derive the Warning Threshold based on a synthetic analysis. And classify users over Warning Threshold into different groups: “Fatal”, “Serious”, ”Suspicious”, “Attention”, and “Normal”. And those categorized as “Fatal” will be blacklisted.

This approach is mainly targeted at developing a B2B E-Market Fraud Detection based on Data Mining. By analyzing the provider and buyer static and dynamic data info, we can recognize those suspicious users without learning about the detailed transaction info. To realize this, we emphasize more about the login, browsing, and email history behaviors and use data mining to deal with these data.