Twitter Firehouse



Only 4 companies
1.     Gnip twitter owns it : https://gnip.com/
2.     Topsy  apple owns it :: http://topsy.com/
4.     NTT Data. :: http://www.nttdata.com/global/en/

 Users send 400 million tweets every day.
The only way to access 100% of those tweets in real-time is through the Twitter “Firehose”. The other option for accessing tweets is using one of Twitter’s direct API offerings.

Twitter’s Search API

First up is Twitter’s Search API, which involves polling Twitter’s data through a search or username. Twitter’s Search API gives you access to a data set that already exists from tweets that have occurred. Through the Search API users request tweets that match some sort of “search” criteria. The criteria can be keywords, usernames, locations, named places, etc. A good way to think of the Twitter Search API is by thinking how an individual user would do a search directly at Twitter (navigating to search.twitter.com and entering in keywords).

How much data can you get with the Twitter Search API?

With the Twitter Search API, developers query (or poll) tweets that have occurred and are limited by Twitter’s rate limits. For an individual user, the maximum number of tweets you can receive is the last 3,200 tweets, regardless of the query criteria. With a specific keyword, you can typically only poll the last 5,000 tweets per keyword. You are further limited by the number of requests you can make in a certain time period. The Twitter request limits have changed over the years but are currently limited to 180 requests in a 15 minute period.

Twitter’s Streaming API

Unlike Twitter’s Search API where you are polling data from tweets that have already happened, Twitter’s Streaming API is a push of data as tweets happen in near real-time. With Twitter’s Streaming API, users register a set of criteria (keywords, usernames, locations, named places, etc.) and as tweets match the criteria, they are pushed directly to the user. Think of this as an agreement between the end user and Twitter – you agree with Twitter that whenever they receive tweets that match keywords relating to “hockey”, they will deliver the tweet directly to you as they happen.  This is a push of data by Twitter, rather than a pull of data initiated by the end user.
The major drawback of the Streaming API is that Twitter’s Steaming API provides only a sample of tweets that are occurring. The actual percentage of total tweets users receive with Twitter’s Streaming API varies heavily based on the criteria users request and the current traffic. Studies have estimated that using Twitter’s Streaming API users can expect to receive anywhere from 1% of the tweets to over 40% of tweets in near real-time. The reason that you do not receive all of the tweets from the Twitter Streaming API is simply because Twitter doesn’t have the current infrastructure to support it, and they don’t want to; hence, the Twitter Firehose.


The final way to access data is by having access to the full Twitter Firehose. The Twitter Firehose is in fact very similar to the Twitter’s Streaming API as it pushes data to end users in near real-time, but the Twitter Firehose guarantees delivery of 100% of the tweets that match your criteria.
The Twitter Firehose is handled by two data providers, GNIP and DataSift, which have tight relationships with Twitter. Similar to the streaming API, the firehose consists of an agreement between an end user and distributors of the Firehose (GNIP or Datasift) on what tweets the end user should receive in near real-time. As the data providers receive tweets they are pushed directly to the end user.
The two differences between Twitter’s Streaming API and Twitter’s Firehose access is that you are guaranteed delivery of 100% of the tweets and it’s not free. The Twitter Streaming API is free to use but gives you limited results (and limited licensing usage of the data). Access to the Twitter Firehose removes a lot of the usage restrictions imposed by Twitter but is fairly costly for access to all the tweets.

Complete Twitter Data Access
Realtime and Historical Twitter Data To Meet The Needs Of Your Business
Gnip was the first authorized reseller of Twitter data. We provide realtime data as well as access to every publicly available Tweet dating back to the very first Tweet from March 21, 2006. Whether you are looking for Tweets about specific keywords, high volumes of data, or historical data, we've got you covered!
  
API
(Application Programming Interface) An API dictates how two interfaces work with each other. In the case of social data, most information is shared through a streaming API.
Backfill
Backfill is Gnip's product that allows you to briefly disconnect from your realtime stream and easily get all of your data when you reconnect.
Big Data
Big Data is a term to describe the value that companies are seeing from using data to create actionable insights.
Choice of Protocols
Choice of protocols means you can receive the data in the format you prefer GET, POST, or Streaming.
Complete
Complete data is when customers have access to the entire set of data on a platform so they never miss a conversation.
Data Collector
Data Collector is Gnip's product that collects and normalizes data from public APIs including Instagram, Flickr, YouTube & more.
Data Mining
A method of computer science that sifts through data to find patterns using machine learning, statistics, database systems and more.
Data Scientist
Considered a relatively new field, the profession of data scientist means different things to different companies and often is a combination of statistics, machine learning, business intelligence, etc.
Data Scraping
Data scraping is when a company doesn't get the data from a social media publisher but rather scrapes content where they can find it. It is never complete, reliable or sustainable.
Decahose
Gnip's decahose provides a random 10 percent sample of the full firehose. We'd also like to openly admit it should be called a decihose, which means a tenth while deca means ten.
Enrichments
Enrichments are how Gnip provides additional metadata to its data streams making it easier for our customers to digest data. Examples include Klout scores, geo location, expanding shortened URLs and more.
Firehose
Firehose is a term first coined by Twitter to describe their complete set of data. Now firehose in conjunction with social media means that you have access to the full set of of a social media publisher's data.
Geotagged
Geotagged data is when a social media publisher lets the user decide if they want to provide the exact location of their content. Geotagged content more often comes from a smart phone.
JSON
(Java Script Object Notation) JSON is a text-based open standard designed for data interchange that even the human eye can read and is easy for computers to parse. JSON is the format Gnip delivers its social data in.
Machine Learning
Machine learning is the concept that you can teach a machine to make better predictions and decisions based on data.
Natural Language Processing
Natural language processing is the discipline of teaching computers to understand the human language.
Node.js
A Javascript framework making it easy to build network applications. It's another way to connect to Gnip and consume data.
Predictive Analytics
The ability to predict future behavior and actions based on past data using machine learning, statistics, dating mining and other techniques.
Public API
Many social media publishers offer a public API providing access to their data but it is often rate limited.
PowerTrack
PowerTrack is Gnip's powerful filtering language that gives you the ability to get complete coverage of the data you need.
RESTful API
With a REST API, you make a request to the server within a certain time period, and get data back only after you make the request.
Sentiment Analysis
Sentiment analysis is a technique for determining the feelings expressed in text aka whether the sentiment of text is angry, sad, happy, etc.
Social Data
Expresses social media in a computer-readable format (e.g. JSON) and shares metadata about the content to help provide not only content, but context. Metadata often includes information about location, engagement and links shared. Unlike social media, social data is focused strictly on publicly shared experiences.
Social Media
User-generated content where one user communicates and expresses themselves and that content is delivered to other users. Examples of this are platforms such as Twitter, Facebook, YouTube, Tumblr and Disqus. Social media is delivered in a great user experience, and is focused on sharing and content discovery. Social media also offers both public and private experiences with the ability to share messages privately.
Streaming API
With a Streaming API, your requests are ongoing as is the data coming your way after you make the requests.
XML
Extensible Markup Language - a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable.