Columbus Data Scraping: November 2014

Friday, 28 November 2014

Webscraping using readLines and RCurl

There is a massive amount of data available on the web. Some of it is in the form of precompiled, downloadable datasets which are easy to access. But the majority of online data exists as web content such as blogs, news stories and cooking recipes. With precompiled files, accessing the data is fairly straightforward; just download the file, unzip if necessary, and import into R. For “wild” data however, getting the data into an analyzeable format is more difficult. Accessing online data of this sort is sometimes reffered to as “webscraping”. Two R facilities, readLines() from the base package and getURL() from the RCurl package make this task possible.

readLines

For basic webscraping tasks the readLines() function will usually suffice. readLines() allows simple access to webpage source data on non-secure servers. In its simplest form, readLines() takes a single argument – the URL of the web page to be read:

web_page <- readLines("http://www.interestingwebsite.com")

As an example of a (somewhat) practical use of webscraping, imagine a scenario in which we wanted to know the 10 most frequent posters to the R-help listserve for January 2009. Because the listserve is on a secure site (e.g. it has https:// rather than http:// in the URL) we can't easily access the live version with readLines(). So for this example, I've posted a local copy of the list archives on the this site.

One note, by itself readLines() can only acquire the data. You'll need to use grep(), gsub() or equivalents to parse the data and keep what you need.

# Get the page's source
web_page <- readLines("http://www.programmingr.com/jan09rlist.html")
# Pull out the appropriate line
author_lines <- web_page[grep("<I>", web_page)]
# Delete unwanted characters in the lines we pulled out
authors <- gsub("<I>", "", author_lines, fixed = TRUE)
# Present only the ten most frequent posters
author_counts <- sort(table(authors), decreasing = TRUE)
author_counts[1:10]
[webscrape results]

We can see that Gabor Grothendieck was the most frequent poster to R-help in January 2009.

The RCurl package

To get more advanced http features such as POST capabilities and https access, you'll need to use the RCurl package. To do webscraping tasks with the RCurl package use the getURL() function. After the data has been acquired via getURL(), it needs to be restructured and parsed. The htmlTreeParse() function from the XML package is tailored for just this task. Using getURL() we can access a secure site so we can use the live site as an example this time.

# Install the RCurl package if necessary
install.packages("RCurl", dependencies = TRUE)
library("RCurl")
# Install the XML package if necessary
install.packages("XML", dependencies = TRUE)
library("XML")
# Get first quarter archives
jan09 <- getURL("https://stat.ethz.ch/pipermail/r-help/2009-January/date.html", ssl.verifypeer = FALSE)
jan09_parsed <- htmlTreeParse(jan09)
# Continue on similar to above
...

For basic webscraping tasks readLines() will be enough and avoids over complicating the task. For more difficult procedures or for tasks requiring other http features getURL() or other functions from the RCurl package may be required. For more information on cURL visit the project page here.

Source: http://www.r-bloggers.com/webscraping-using-readlines-and-rcurl-2/

Thursday, 27 November 2014

Data Mining KNN Classifier

Q1

Suppose a data analyst working for an insurance company was asked to build a predictive model for predicting weather a customer will buy a mobile home insurance policy. S/he tried kNN classifier with different number of neighbours (k=1,2,3,4,5). S/he got the following F-scores measured on the training data: (1.0; 0.92; 0.90; 0.85; 0.82). Based on that the analyst decided to deploy kNN with k=1. Was it a good choice? How would you select an optimal number of neighbours in this case?

1 Answer

It is not a good idea to select a parameter of a prediction algorithm using the whole training set as the result will be biased towards this particular training set and has no information about generalization performance (i.e. performance towards unseen cases). You should apply a cross-validation technique e.g. 10-fold cross-validation to select the best K (i.e. K with largest F-value) within a range. This involves splitting your training data in 10 equal parts retain 9 parts for training and 1 for validation. Iterate such that each part has been left out for validation. If you take enough folds this will allow you as well to obtain statistics of the F-value and then you can test whether these values for different K values are statistically significant.

See e.g. also: http://pic.dhe.ibm.com/infocenter/spssstat/v20r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.help%2Falg_knn_training_crossvalidation.htm

The subtlety here however is that there is likely a dependency between the number of data points for prediction and the K-value. So If you apply cross-validation you use 9/10 of the training set for training...Not sure whether any research has been performed on this and how to correct for that in the final training set. Anyway most software packages just use the abovementioned techniques e.g. see SPSS in the link. A solution is to use leave-one-out cross-validation (each data samples is left out once for testing) in that case you have N-1 training samples(the original training set has N).

Source:http://stackoverflow.com/questions/21121509/data-mining-knn-classifier?rq=1

Monday, 24 November 2014

Data Mining Outsourcing in a Better and Unique Approach

Data mining outsourcing services are ideal for clarity in various decision making processes. It is the ultimate goal of any organization and business to increase on its profits as well as strengthen the bond with its customers. Equipping the business in such a way that it’s very easy to detect frauds and manage risks in a convenient manner is equally important. Volumes of data that are irrelevant or cannot be used when raw needs to be converted to a more useful form. The data mining outsourcing services can greatly help you to analyze and interpret data in a more diligent way.

This service to reliable, experienced and qualified hands is very important. Your research project or engineering project can be easily and conveniently handled by experienced staff who guarantees you an accuracy level of about 98% and a massive reduction in operating costs. The quality of work is unsurpassed and the presentation is done in a format that is easy and simple for you. The project is done in a very short time alleviating you delays as well as ensuring on-time completion of your projects. To enjoy a successful outsourcing experience, you need to bank on a famous and reliable expertise.

The only time to rely with data mining outsourcing services is when you do not have a reliable, experienced expertise in your business. Statistics indicate that it’s very easy to lose business intelligence or expose the privacy of the customers through this process. However companies which offer secure outsourcing process are on the increase as a result of massive competition. It’s an opportunity to develop your potential of sourced data and improve your business in all fields.

Data mining potential applications are infinite. However major applications are in the marketing research and scientific projects. It’s done both on large and small quantities of data by experienced staff well known for their best analytical procedures to guarantee you accurate and easy to use information. Data mining outsourcing services are the only perfect way to profitability.

Source:http://www.e-edge.biz/Data_Mining_Outsourcing_in_a_Better_and_Unique_Approach.html

Thursday, 20 November 2014

Is It Time to End Screen Scraping?

As the industry works to improve the way online banking information is shared with personal financial management apps, a debate is brewing over whether to end the decades-old practice of screen scraping.

Proponents of the popular method say it is a valuable supplement to direct data feeds that may be incomplete or out-of-date. But screen scraping also raises risk concerns, since like other data collection methods it requires consumers to cough up their banking credentials.

"I have not talked to a bank that hasn't confirmed it's a growing problem in their organization," said Jim Routh, the chairman of the products and services committee at Financial Services Information Sharing and Analysis Center.

Financial institutions worry that data aggregators may not take all the appropriate security precautions. According to the FS-ISAC, an industry organization, startups are entering the aggregation market without making security a higher priority.

Routh, who is Aetna's chief information security officer and a former global head of application and mobile security for JPMorgan Chase, said the upstarts do some things well, but "protecting credentials isn't necessarily high on their priorities." The problem is worsened by data aggregators that collect marketing data, such as the device a consumer is using, to understand their behaviors across channels, he said.

The FS-ISAC has proposed creating a standard application programming interface to share information from bank accounts. The API would serve as the conduit for data when consumers wish to use a web or mobile app to receive push bill reminders, to verify their bank accounts or for numerous other PFM use cases.

The proposed API would also be designed to reduce the storage of financial data. But if the industry embraces the model, it would be harder for aggregators to do screen-scraping.

For years, PFM companies have used this tool to obtain customers' banking account information. With consumers' permission, aggregators log in with the customer's user name and password to grab financial data and use it to populate the mobile or web app of the customer's choice — whether or not the bank supports the technique.

Yodlee, which works with more than 300 banks as well as startups, argues that there is a place and a need for aggregators to collect data through various techniques to provide the best customer experience.

Brian Costello, vice president of operations and security at Yodlee, said his company uses a combination of methods to gather customer account data. If it couldn't get data from a direct feed, it could also screen scrape.

If the industry moved to embracing only one data exchange method, Yodlee could be more vulnerable to the problem of receiving outdated information from the banks.

When a bank changes an annual percentage rate, if it doesn't update the data feed it sends to the aggregator right away, the PFM services that rely on that data will appear stale. (Services like Credit Karma, Mint and Wallaby, for example, rely on aggregation technology to recommend financial products to consumers according to price, among other things.)

Proper maintenance of data feeds, of course, takes time and money — resources many banks are short on. But delays could also result from the bankers' dilemma: On the one hand, they want to let customers aggregate their accounts to gather intelligence on their competitors. On the other hand, they may have reservations about their rivals collecting that same data in the battle for wallet share.

"Banks are under tremendous pressure to retain and obtain more clients," said Costello.

Screen scraping also has maintenance requirements, though. The FS-ISAC white paper draft said the approach "requires some coordination from the FI to allow what appears to be an automated attack against their application. To avoid blocking the aggregator's attempt to screen scrape the financial institution's application with this or other current security controls, a whitelist of aggregator IPs are set up and maintained by the FIs."

Like Costello, Marc West, president of digital channels at Fiserv, said a combination of data collection methods is better than a standard data exchange approach that might fail to extract the necessary information. Any data feed, said West, offers a limited set of data and information, while a scrape can enable a custom data extract.

But Aetna's Routh said moving to a real-time API model would improve a recurring issue caused by screen scraping: customer service hiccups. A consumer may call the company behind the personal financial app when a link to an account is broken. The PFM provider might tell him to call the bank, when the problem could lie with the aggregator not knowing of an update to the bank's code.

"The consumer gets in the middle of a customer service issue that is thorny at best and unsolvable at worst," Routh said. "Unfortunately that happens more frequently than anyone would like to it happen.

The new model, then, is "inevitable" in Routh's point of view because of the risk and economics involved. "This won't happen overnight," he said. "It needs some legs."

Kristin Moyer, a research vice president in industry advisory services and banking and investment services at Gartner, said she expects more banks to embrace APIs as a way to compete in a digital world.

Already financial institutions like Capital One, Agricole Bank and Fidor Bank are piloting and testing the OAuth specification, which lets banks keep ownership of the customer log-in data but requires them to make available an API. (The FS-ISAC is also promoting OAuth 2.0 as a way to strengthen aggregation security.)

"It's something we will see a lot more of in the next two to three years," said Moyer. "It's an exciting time…I think the use of APIs will enable us as an industry [to do things] that we never really imagined possible before."

LESSONS ABROAD

The move away from screen scraping has already happened in some countries that lack a data exchange standard. Regulators in Poland, for example, recently recommended the practice halt. Responding to the guidance, mBank is one of the banks that changed its aggregation roadmap.

The bank, which spun off from BRE Bank, had been piloting a PFM service with friends and family and has now suspended the pilot. It had, however, already made use of aggregation technology so consumers, who weren't customers of the bank, could get loan decisions from mBank within half an episode of "Modern Family." Indeed, the bank would screen scrape consumers' external bank accounts to make a loan decision within five to 15 minutes. Now, loan decisions have to be made at a branch or for a smaller dollar amount after a consumer sends the bank a copy of an electronic statement.

"Right now we have to put it on the shelf. We haven't killed it. We want to resurrect it," said Michal Panowicz, senior director at mBank.

Overall, he sounds calm about the setback. "This is a regulator decision," said Panowicz. "We have to respect that. …We have to live with them on good footing."

But that doesn't mean it has given up on aggregation. Payday lenders can continue to screen scrape financial data in order to make loan decisions in Poland — which makes it an uneven playing field.

"We will try to convey the logic that [screen scraping] cannot be stopped," said Panowicz.

He views it as a longer term game for something he believes is valuable to consumers. mBank like other banks wants to realize the true aggregation dream: letting customers quickly switch bank accounts and products if they wish.

"To be honest, it's the most exciting part about aggregation... to move accounts to us without spending a minute of physical labor," he said.

Source:http://www.americanbanker.com/news/technology/is-it-time-to-end-screen-scraping-1071118-1.html

Tuesday, 18 November 2014

Data Scraping Guide for SEO & Analytics

Data scraping can help you a lot in competitive analysis as well as pulling out data from your client’s website like extracting the titles, keywords and content categories.

You can quickly get an idea of which keywords are driving traffic to a website, which content categories are attracting links and user engagement, what kind of resources will it take to rank your site…………and the list goes on…

Scraping Organic Search Results

By scraping organic search results you can quickly find out your SEO competitors for a particular search term. You can determine the title tags and the keywords they are targeting.

    The easiest way to scrape organic search results is by using the SERPs Redux bookmarklet.

For e.g if you scrape organic listings for the search term ‘seo tools’ using this bookmarklet, you may see the following results:

You can copy paste the websites URLs and title tags easily into your spreadsheet from the text boxes.

    Pro Tip by Tahir Fayyaz:

    Just wanted to add a tip for people using the SERPs Redux bookmarklet.

    If you have a data separated over multiple pages that you want to scrape you can use AutoPager for Firefox or Chrome to loads x amount of pages all on one page and then scrape it all using the bookmarklet.

Scraping on page elements from a web document

Through this Excel Plugin by Niels Bosma you can fetch several on-page elements from a URL or list of URLs like:

    Title tag
    Meta description tag
    Meta keywords tag
    Meta robots tag
    H1 tag
    H2 tag
    HTTP Header
    Backlinks
    Facebook likes etc.

Scraping data through Google Docs

Google docs provide a function known as importXML through which you can import data from web documents directly into Google Docs spreadsheet. However to use this function you must be familiar with X-path expressions.

    Syntax: =importXML(URL,X-path-query)

    url=> URL of the web page from which you want to import the data.

    x-path-query => A query language used to extract data from web pages.

You need to understand following things about X-path in order to use importXML function:

1. Xpath terminology- What are nodes and kind of nodes like element nodes, attribute nodes etc.

2. Relationship between nodes- How different nodes are related to each other. Like parent node, child node, siblings etc.

3. Selecting nodes- The node is selected by following a path known as the path expression.

4. Predicates – They are used to find a specific node or a node that contains a specific value. They are always embedded in square brackets.

If you follow the x-path tutorial then it should not take you more than an hour to understand how X path expressions works.

Understanding path expressions is easy but building them is not. That’s is why i use a firefbug extension named ‘X-Pather‘ to quickly generate path expressions while browsing HTML and XML documents.

Since X-Pather is a firebug extension, it means you first need to install firebug in order to use it.

How to scrape data using importXML()

Step-1: Install firebug – Through this add on you can edit & monitor CSS, HTML, and JavaScript while you browse.

Step-2: Install X-pather – Through this tool you can generate path expressions while browsing a web document. You can also evaluate path expressions.

Step-3: Go to the web page whose data you want to scrape. Select the type of element you want to scrape. For e.g. if you want to scrape anchor text, then select one anchor text.

Step-4: Right click on the selected text and then select ‘show in Xpather’ from the drop down menu.

Then you will see the Xpather browser from where you can copy the X-path.

Here i have selected the text ‘Google Analytics’, that is why the xpath browser is showing ‘Google Analytics’ in the content section. This is my xpath:

    /html/body/div[@id='page']/div[@id='page-ext']/div[@id='main']/div[@id='main-ext']/div[@id='mask-3']/div[@id='mask-2']/div[@id='mask-1']/div[@id='primary-content']/div/div/div[@id='post-58']/div/ol[2]/li[1]/a

Pretty scary huh. It can be even more scary if you try to build it manually. I want to scrape the name of all the analytic tools from this page: killer seo tools. For this i need to modify the aforesaid path expression into a formula.

This is possible only if i can determine static and variable nodes between two or more path expressions. So i determined the path expression of another element ‘Google Analytics Help center’ (second in the list) through X-pather:

    /html/body/div[@id='page']/div[@id='page-ext']/div[@id='main']/div[@id='main-ext']/div[@id='mask-3']/div[@id='mask-2']/div[@id='mask-1']/div[@id='primary-content']/div/div/div[@id='post-58']/div/ol[2]/li[2]/a

Now we can see that the node which has changed between the original and new path expression is the final ‘li’ element: li[1] to li[2]. So i can come up with following final path expression:

    /html/body/div[@id='page']/div[@id='page-ext']/div[@id='main']/div[@id='main-ext']/div[@id='mask-3']/div[@id='mask-2']/div[@id='mask-1']/div[@id='primary-content']/div/div/div[@id='post-58']/div/ol[2]//li/a

Now all i have to do is copy-paste this final path expression as an argument to the importXML function in Google Docs spreadsheet. Then the function will extract all the names of Google Analytics tool from my killer SEO tools page.

This is how you can scrape data using importXML.

    Pro Tip by Niels Bosma: “Anything you can do with importXML in Google docs you can do with XPathOnUrl directly in Excel.”

    To use XPathOnUrl function you first need to install the Niels Bosma’s Excel plugin. It is not a built in function in excel.

Note:You can also use a free tool named Scrapy for data scraping. It is an an open source web scraping framework and is used to extract structured data from web pages & APIs. You need to know Python (a programming language) in order to use scrapy.

Scraping on-page elements of an entire website

There are two awesome tools which can help you in scraping on-page elements (title tags, meta descriptions, meta keywords etc) of an entire website. One is the evergreen and free Xenu Link Sleuth and the other is the mighty Screaming Frog SEO Spider.

What make these tools amazing is that you can scrape the data of entire website and download it into excel. So if you want to know the keywords used in the title tag on all the web pages of your competitor’s website then you know what you need to do.

Note: Save the Xenu data as a tab separated text file and then open the file in Excel.

Scraping organic and paid keywords of an entire website

The tool that i use for scraping keywords is SEMRush. Through this awesome tool i can determine which organic and paid keyword are driving traffic to my competitor’s website and then can download the whole list into excel for keyword research. You can get more details about this tool through this post: Scaling Keyword Research & Competitive Analysis to new heights

Scraping keywords from a webpage

Through this excel macro spreadsheet from seogadget you can fetch keywords from the text of a URL(s). However you need an Alchemy API key to use this macro.

You can get the Alchemy API key from here

Scraping keywords data from Google Adwords API

If you have access to Google Adwords API then you can install this plugin from seogadget website. This plugin creates a series of functions designed to fetch keywords data from the Google Adwords API like:

getAdWordAvg()- returns average search volume from the adwords API.

getAdWordStats() – returns local search volume and previous 12 months separated by commas

getAdWordIdeas() – returns keyword suggestions based on API suggest service.

Check out this video to know how this plug-in works

Scraping Google Adwords Ad copies of any website

I use the tool SEMRush to scrape and download the Google Adwords ad copies of my competitors into excel and then mine keywords or just get ad copy ideas. Go to semrush, type the competitor website URL and then click on ‘Adwords Ad texts’ link on the left hand side menu. Once you see the report you can download it into excel.

Scraping back links of an entire website

The tool that you can use to scrape and download the back links of an entire website is: open site explorer

Scraping Outbound links from web pages

Garrett French of citation Labs has shared an excellent tool: OBL Scraper+Contact Finder which can scrape outbound links and contact details from a URL or URL list. This tool can help you a lot in link building. Check out this video to know more about this awesome tool:

Scraper – Google chrome extension

This chrome extension can scrape data from web pages and export it to Google docs. This tool is simple to use. Select the web page element/node you want to scrape. Then right click on the selected element and select ‘scrape similar’.

Any element/node that’s similar to what you have selected will be scraped by the tool which you can later export to Google Docs. One big advantage of this tool is that it reduces our dependency on building Xpath expressions and make scraping easier.

See how easy it is to scrape name and URLs of all the Analytics tools without using Xpath expressions.

Source: http://www.optimizesmart.com/data-scraping-guide-for-seo/

Monday, 17 November 2014

Is Web Scraping Legal?

Web scraping might be one of the best ways to aggregate content from across the internet, but it comes with a caveat: It’s also one of the hardest tools to parse from a legal standpoint.

For the uninitiated, web scraping is a process whereby an automated piece of software extracts data from a website by “scraping” through the site’s many pages. While search engines like Google and Bing do a similar task when they index web pages, scraping engines take the process a step further and convert the information into a format which can be easily transferred over to a database or spreadsheet.

It’s also important to note that a web scraper is not the same as an API. While a company might provide an API to allow other systems to interact with its data, the quality and quantity of data available through APIs is typically lower than what is made available through web scraping. In addition, web scrapers provide more up-to-date information than APIs and are much easier to customize from a structural standpoint.

The applications of this “scraped” information are widespread. A journalist like Nate Silver might use scrapers to monitor baseball statistics and create numerical evidence for a new sports story he’s working on. Similarly, an eCommerce business might bulk scrape product titles, prices, and SKUs from other sites in order to further analyze them.

Legality of Web ScrapingWhile web scraping is an undoubtedly powerful tool, it’s still undergoing growing pains when it comes to legal matters. Because the scraping process appropriates pre-existing content from across the web, there are all kinds of ethical and legal quandaries that confront businesses who hope to do leverage scrapers for their own processes.

In this “wild west” environment, where the legal implications of web scraping are in a constant state of flux, it helps to get a foothold on where the legal needle currently falls. The following timeline outlines some of the biggest cases involving web scrapers in the United States, and allows us to achieve a greater understanding on the precedents that surround the court rulings.

Terms of Use Tug-of-War—2000-2009

For years after they first came into use, web scrapers went largely unchallenged from a legal standpoint. In 2000, however, the use of scrapers came under heavy and consistent fire when eBay fired the first shot against an auction data aggregator called Bidder’s Edge. In this very early case, eBay argued that Bidder’s Edge was using scrapers in a way that violated Trespass to Chattels doctrine. While the lawsuit was settled out of court, the judge upheld eBay’s original injunction, stating that heavy bot traffic could very well disrupt eBay’s service.

Then in 2003’s Intel Corp. v. Hamidi, the California Supreme court overturned the basis of eBay v. Bidder’s Edge, ruling that Trespass to Chattels could not extend to the context of computers if no actual damage to personal property occurred.

So in terms of legal action against web scraping, Tresspass to Chattels no longer applied, and things were back to square one. This began a period in which the courts consistently rejected Terms of Service as a valid means of prohibiting scrapers, including cases like Perfect 10 v. Google, and Cvent v. Eventbrite.

The Takeaway: The earliest cases against scrapers hinged on Trespass to Chattels law, and were successful. However, that doctrine is no longer a valid approach.

Facebook Web Scraping2009—Facebook Steps In

In 2009, Facebook turned the tides of the web scraping war when Power.com, a site which aggregated multiple social networks into one centralized site, included Facebook in their service. Because Power.com was scraping Facebook’s content instead of adhering to their established standards, Facebook sued Power on grounds of copyright infringement.

In denying Power.com’s motion to dismiss the case, the Judge ruled that scraping can constitute copying, however momentary that copying may be. And because Facebook’s Terms of Service don’t allow for scraping, that act of copying constituted an infringement on Facebook’s copyright. With this decision, the waters regarding the legality of web scrapers began to shift in favor of the content creators.

The Takeaway: Even if a web scraper ignores infringing content on its way to freely-usable content, it might qualify as copyright infringement by virtue of having technically “copied” the infringing content first.

2011-2014— U.S. v Auernheimer

In 2010, hacker Andrew “Weev” Auernheimer found a security flaw in AT&T’s website, which would display the email addresses of users who visited the site via their iPads. By exploiting the flaw using some simple scripts and a scraper, Auernheimer was able to gather thousands of emails from the AT&T site.

Although these email addresses were publicly available, Auernheimer’s exploit led to his 2012 conviction, where he was charged with identity fraud and conspiracy to access a computer without authorization.

Data ScrapingEarlier this year, the court vacated Auernheimer’s conviction, ruling that the trial’s New Jersey venue was improper. But even though the case turned out to be mostly inconclusive, the court noted the fact that there was no evidence to show that “any password gate or code-based barrier was breached.” This seems to leave room for the web scraping of publicly-available personal information, although it’s still very much open to interpretation and not set in stone.

The Takeaway: Using a web scraper to aggregate sensitive personal information can lead to a conviction, even if that information was technically available to the public. While there is hope in the court’s observation that no passwords or barriers were broken to retrieve this information, the waters here are still very volatile.

2013—Associated Press vs. Meltwater

Meltwater is a software company whose “Global Media Monitoring” product uses scrapers to aggregate news stories for paying clients. The Associated Press took issue with Meltwater’s scraping of their original stories, some of which had been copyrighted. In 2012, AP filed suit against Meltwater for copy infringement and hot news misappropriation.

While it’s already been established that facts cannot be copyrighted, the court decided that the AP’s copyrighted articles—and more specifically, the way in which the facts within those articles were arranged—were not fair game for copying. On top of this, Meltwater’s use of the articles failed to meet the established fair use standards, and could not be defended on that front either.

The Takeaway: Fair use is limited when it comes to web scrapers, and copyrighted content is not always open to be scraped.

~~

By closely observing the outcomes of previous rulings, you’ll find that there are a few guidelines that a scraper should attempt to adhere to:

    Content being scraped is not copyright protected
    The act of scraping does not burden the services of the site being scraped
    The scraper does not violate the Terms of Use of the site being scraped
    The scraper does not gather sensitive user information
    The scraped content adheres to fair use standards

While all of these guidelines are important to understand before using scrapers, there are other ways to acclimate to the legal nuances. In many cases, you’ll find that a simple conversation with a business software developer or consultant will lead to some satisfying conclusions: Odds are, they’ve used scrapers in the past and can shed light on any snags they’ve hit in the process. And of course, talking with a lawyer is always an ideal course of action when treading into questionable legal territory.

Source:http://blog.icreon.us/2014/09/12/web-scraping-and-you-a-legal-primer-for-one-of-its-most-useful-tools/

Friday, 14 November 2014

Interactive Crawls for Scraping AJAX Pages on the Web

Crawling pages on the web has become an everyday affair for most enterprises. Too often do we come across offline businesses as well who’d like data gathered from the web for internal analyses. All this eventually to serve customers faster and better. At times, when the crawl job is high-end cum high-scale, businesses also consider DaaS providers to supplement their efforts.

However, the web landscape too has evolved with newer technologies that provide fancy experiences to web users. AJAX elements are one such common aid that leave even the DaaS providers perplexed. They come in various forms from a user’s point of view-

1. Load more results on the same page

2. Filter results based on various selection criteria

3. Submit forms, etc.

When crawling a non-AJAX page, simple GET requests do the job. However, AJAX pages work with POST requests that are not easy to trace for a normal bot.

Difference between GET request and POST request- Scraping

GET vs. POST

At PromptCloud, from our experience with a number of AJAX sites on the web, we’ve crossed the tech barrier. Below is a quick review about the challenges that come with AJAX crawling and its indicative solutions-

1. Javascript Emulations- A bot essentially emulates human browsing to fetch pages. When this needs to be done for Javascript components on a page, it gets tricky. Headless browser, which emulates human interaction with a web page without an interface, is the current approach. These browsers click on various elements/ dropdown lists that are embedded within Javascript code and capture responses to be transferred to programs. Which headless browser to pick depends on what fits well into your current stack.

2. Fetch Bandwidths- Unlike GET requests which complete pretty quickly, POST requests take quite a bit of time due to the number of events involved per fetch. Hence a good amount of bandwidth needs to be allocated in order to receive the response. For the same reason, wait times need to be taken care of too else you might end up with incomplete responses.

3. .NET Architectures- This is a more complex scenario related to maintaining the View State. Most of the postbacks come with an event and its validation. The bot needs to track the view state and pass validations for the event to occur so that the code can be executed and results captured. This is achieved by adopting a mechanism to restore states if things break midway.

4. Page Encoding- Request and response headers need to be taken care of on AJAX pages. The request needs to be sent in the exact format as expected by the server (Content-type or media type, accept fields, etc.) and similarly responses need to be parsed based on the content-type.

A Use Case

One of our clients who is into sale of event tickets at discounted rates had us crawl one of the ticketing sites on the web weekly; one of the most complex AJAX crawling we’ve dealt with so far. For the data that was to be extracted, multiple AJAX fetches were needed depending on the selections made. Requests had to be made for a combination of items from the dropdown box. These came with cookies and session IDs. To add to the challenge the site was extremely dynamic and changed its structure every week making it difficult for us to follow what data was where on the page.

We developed an AJAX crawler specific to this site to take care of all the dynamics. Response times were taken care of so that we didn’t miss any relevant information. We included an ML component to improve the crawler which is now pretty stable irrespective of changes on the site.

Overall, AJAX crawling requires more compute power in addition to the tech expertise. And because there’s no uniformity on the web, there’s always a new challenge to overcome in this landscape. It wouldn’t be an overrating if we said we’ve done a good job at that so far and have developed the knack :)

Reach out to us for any kind of web scraping/ crawling- either AJAX or not. We’ll take care of the complexities.

Source: https://www.promptcloud.com/blog/web-scraping-interactive-ajax-crawls/

Wednesday, 12 November 2014

Web scraping services-importance of scraped data

Web scraping services are provided by computer software which extracts the required facts from the website. Web scraping services mainly aims at converting unstructured data collected from the websites into structured data which can be stockpiled and scrutinized in a centralized databank. Therefore, web scraping services have a direct influence on the outcome of the reason as to why the data collected in necessary.

It is not very easy to scrap data from different websites due to the terms of service in place. So, the there are some legalities that have been improvised to protect altering the personal information on different websites. These ‘rules’ must be followed to the letter and to some extent have limited web scraping services.

Owing to the high demand for web scraping, various firms have been set up to provide the efficient and reliable guidelines on web scraping services so that the information acquired is correct and conforms to the security requirements. The firms have also improvised different software that makes web scraping services much easier.

Importance of web scraping services

Definitely, web scraping services have gone a long way in provision of very useful information to various organizations. But business companies are the ones that benefit more from web scraping services. Some of the benefits associated with web scraping services are:

    Helps the firms to easily send notifications to their customers including price changes, promotions, introduction of a new product into the market. Etc.
    It enables firms to compare their product prices with those of their competitors
    It helps the meteorologists to monitor weather changes thus being able to focus weather conditions more efficiently
    It also assists researchers with extensive information about peoples’ habits among many others.
    It has also promoted e-commerce and e-banking services where the rates of stock exchange, banks’ interest rates, etc. are updated automatically on the customer’s catalog.

Advantages of web scraping services

The following are some of the advantages of using web scraping services

    Automation of the data

    Web scraping can retrieve both static and dynamic web pages

    Page contents of various websites can be transformed

    It allows formulation of vertical aggregation platforms thus even complicated data can still be extracted from different websites.

    Web scraping programs recognize semantic annotation

    All the required data can be retrieved from their websites

    The data collected is accurate and reliable

Web scraping services mainly aims at collecting, storing and analyzing data. The data analysis is facilitated by various web scrapers that can extract any information and transform it into useful and easy forms to interpret.

Challenges facing web scraping

    High volume of web scraping can cause regulatory damage to the pages

    Scale of measure; the scales of the web scraper can differ with the units of measure of the source file thus making it somewhat hard for the interpretation of the data

    Level of source complexity; if the information being extracted is very complicated, web scraping will also be paralyzed.

It is clear that besides web scraping providing useful data and information, it experiences a number of challenges. The good thing is that the web scraping services providers are always improvising techniques to ensure that the information gathered is accurate, timely, reliable and treated with the highest levels of confidentiality.

Source: http://www.loginworks.com/blogs/web-scraping-blogs/191-web-scraping-services-importance-of-scraped-data/

Tuesday, 11 November 2014

Review: import.io’s New Scraping Process and Features

Web scraping Data platform import.io, announced last week that they have secured $3M in funding from investors that include the founders of Yahoo! and MySQL.

They also released a new beta version of the tool that is essentially a better version of their extraction tool, with some new features and a much cleaner and faster user experience.

First Impression

I’ve used the tool for a week and can say it is an improvement over the old version – which was a bit bulky and awkward. While still not exactly the most intuitive process, the development team at import.io has managed to slim down what was a relatively button heavy process, without sacrificing any of the functionality – they made the new workflow both simpler and more complicated at the same.

The new version features a simple tool bar across the top as opposed to the space hogging table and wizard from before, which is a large improvement on the pink and white of the previous version.

True, the loss of the wizard means there isn’t as much guidance as before (the pop-up help only appears on the first use), but the undo button means you don’t really need it. You can click around and experiment a bit with the different extraction options before settling down to do some real work.

Data Extraction

Once you’ve figured out how it works, the new version requires far fewer mouse clicks to get from the page to a table of data/API as shown in their homepage video.

All you need to do now is navigate to a website, click a single piece of data on the page – such as price, image, or URL – and their app will find all the other examples of similar data on the website, immediately creating a structured table of data.

download2

This latest version of the extractor also includes a important new feature labeled “Suggest Data”. Its important because it lets you extract all the data from a page, instantly creating a table of data that can be published as an API. This makes import.io very exciting and quick, I spent a long time playing with this and it worked on the majority of sites.

Advanced Features

Most non-programmer web scrapers struggle with complex sites that use JavaScript or iFrames, but import.io also now deals with this. In the basic mode you can toggle JavaScript and CSS on and off to help you see your data better.

If that doesn’t work, you can switch into an ‘advanced mode’ where import allows you to write your own XPath and RegExp. They’ve also added a source code view, though without the ability to click on the site and inspect element (like in Chrome) this feature isn’t particularly useful.

API Integration

Once you’ve created your scraper, there are a number of options for what you can do with it.

If you’ want you can just copy and paste the data into a spreadsheet or Download as CSV. You can also push your data directly Google Sheets, with import.io’s self generated formula.

For the rest of us, they have surfaced both the POST and GET requests for you and given you a JSON view which allows you to see how the data is returned, which is handy.

All this functionality is nice, and it’s clear they’re trying to cater to all technical levels, but it has made the API page somewhat messy and potentially confusing for newer or less technical users, but they should be able to get what they need.

Good with lots of Potential

Their new tool certainly isn’t perfect. There are still a few sites where manual row training is required and you can’t access the authentication feature (though you can still do this in the old version) or pagination.

Even if it’s not quite there yet, if import.io continue like this, they are well on its way to becoming the best data scraping platform on the market. Especially when you consider the “free for life” price tag.

Source:http://scraping.pro/review-import-ios-new-scraping-tools-features/

Sunday, 9 November 2014

Web Scraping Enters Politics

Web scraping is becoming an essential tool in gaining an edge over everything about just anything. This is proven by international news on US political campaigns, specifically by identifying wealthy donors. As is commonly known, election campaigns should follow a rule regarding the use of a certain limited amount of money for the expenses of each candidate. Being so, much of the campaign activities must be paid by supporters and sponsors.

It is not a surprise then that even politics is lured to make use of the dynamic and ever growing data mining processes. Once again, web mining has proven to be an essential component of almost all levels of human existence, the society, and the world as a whole. It proves its extraordinary capacity to dig precious information to reach the much aspired for goals of every individual.

Mining for personal information

The CBC News online very recently disclosed that the US Republican presidential candidate Mitt Romney has used data mining in order to identify rich donors. It is reported that the act of getting personal information such as the buying history and church attendance were vital in this incident. Through this information, the party was able to identify prospective rich donors and indeed tap them. As a businessman himself, Romney knows exactly how to fish and where the fat fish are. Moreover, what is unique about the identified donors is that they have never been donating before.

Source:http://www.loginworks.com/blogs/web-scraping-blogs/web-scraping-enters-politics/

Friday, 7 November 2014

Why People Hesitate To Try Data Mining

What is hindering a number of people from venturing into the promising world of data mining? Despite so much encouragement, promotions, testimonials, and evidences of the benefits of online data collection, still only a handful take the challenge and really gain the pay offs it has to offer.

It may sound unthinkable that such an opportunity for success has been neglected by many. It may also sound absurd why many well-meaning individuals are hindered from enjoying the benefits of the blessings of the 21st century.

The Causes

After considerable observation and analysis of the human psyche, one can understand the underlying reasons behind the hesitance to try the profitable data mining service. The most common reasons why people are afraid to try new technology or why they remain passive and uninvolved are: fear; lack of knowledge; and pride.

Fear. The most paralyzing of human emotions is fear. It can, to some extent, cause a person to be insane, unprofitable, sick, and lost. Although fear is a normal reaction to certain stimuli and a natural feeling experienced by humans, it must always be monitored and controlled. Usually, people share common fears, such as: fear of change; fear of anything new; and fear of the unknown.

Source:http://www.loginworks.com/blogs/web-scraping-blogs/people-hesitate-try-data-mining/