Social Media Data Archives - relataly.com

Automate Crypto Trading with a Python-Powered Twitter Bot and Gate.io Signals

Florian Follonier — Wed, 19 May 2021 04:57:00 +0000

This tutorial develops a Twitter bot in Python that will generate automated trading signals. The bot will pull real-time price data on various cryptocurrencies (Bitcoin, Ethereum, Doge, etc.) from the crypto exchange Gate.io and analyze it using predefined rules. Whenever the bot detects a relevant price change, it automatically posts a tweet via Twitter. Simple Twitter bots can proactively inform their audiences about relevant events in the market. Such an event can be a sharp rise or fall in price or a sudden spike in the trading volume. If we examine data for specific price movements, we can also store these events and use them later to train a predictive model.

More advanced signal bots use predictive models to signal when it is appropriate to enter or exit the market. Or the bot executes the buy- and sell-orders directly itself. A well-defined signaling logic can therefore constitute the first step toward algorithmic trading. But one thing at a time. So in this article, we will begin by developing a simple signal bot.

The rest of this article is structured as follows. First, we take a look at the different code modules of the Twitter bot. After that, we’ll implement the other code modules in Python. Finally, we will integrate the modules and run some tests. We will also quickly introduce the APIs used to build the bot.

Bots can do a lot of cool things but should be used with caution. Image created with Midjourney

Different Modules of the Signal Bot

This section briefly describes the conceptual architecture of the Crypto Twitter bot. Its architecture adheres to a modular design pattern and separates into four loosely coupled modules. Each module has a clear function.

The Data Collection Module retrieves price data from the crypto exchange Gate.io. The module sends requests at regular intervals against the gate.io API. The module adds the data to separate data stores – one for each cryptocurrency. It then forwards the data to the preprocessing module.
The Data Preprocessing Module calculates the statistical indicators, such as moving averages or means, which become the basis for the signaling logic.
The Signaling Module searches for relevant events based on the indicator values provided. If a relevant event is detected, it is reported to the communication module.
The Communication Module connects to the Twitter API. As soon as it is informed about a new event, it tweets about this event on Twitter.

Now that you are familiar with the modules of our Crypto Twitter Bot, we can take a look at its underlying APIs.

Components of the Relataly Crypto Signal Bot

About the APIs Used in this Tutorial

In this tutorial, we will be using two APIs:

The Gate.io API to fetch price data.
Twitter to post Tweets about Trading Signals

The Gate.io API

Firstly, we will be using the Gate.io API to obtain prices for various cryptocurrencies. Gate.io is one of the smaller crypto exchanges in the crypto-verse. However, it offers a wide range of smaller cryptocurrencies, especially those you cannot trade anywhere else. As of now, the gate.io market endpoint does not require authentication to use its essential functions.

Check out our recent relataly gate.io tutorial to learn how to pull data via the gate.io API in Python.

The Twitter API

The second API that our bot will use is the Twitter API. We will use this API via the Python package Tweepy to post crypto price signals. Check out this article if you are looking for a simple code example of submitting tweets via the Twitter API. If you don’t want to use Twitter, you can disable its use in the code.

Posting tweets via the API requires authentication with a valid developer account. You can apply for a developer account for free on the Twitter developer website. Just be aware that the confirmation can sometimes take several days.

Storing the Twitter API Key

Storing API keys in your code can compromise the security of your application. If the code is made public, for example, by publishing it on a code-sharing website like GitHub, anyone who has access to the code can use the API key to make requests to the API and potentially access sensitive information or cause harm to your account or application. A better practice is to import and access the API key from a separate YAML file, from where you can import it into your project. To store the Twitter API Key, create a YAML file with the name “api_config_twitter.yml” and insert your API key into this file as follows:

api_key: “your api key”

Implementing a Twitter Signal Bot using Python

In this article, we will walk through the process of creating a Twitter bot that automatically tweets updates about cryptocurrency prices. The bot will be designed to pull real-time data on cryptocurrency prices from an external API, and then automatically generate and post tweets on a regular basis. By the end of the article, you will have a fully functional Twitter bot that can keep your followers informed about the latest cryptocurrency prices.

Note: You require a Twitter developer account if you want to use the Twitter functionality. Without an account, you can still print out trading signals to yourself, but you will not be able to post them via the Twitter API.

The code is available on the GitHub repository.

View on GitHub Relataly GitHub Repo

Disclaimer: This article does not constitute financial advice. Stock markets can be very volatile and are generally difficult to predict. Predictive models and other forms of analytics applied in this article only illustrate machine learning use cases.

Python Prerequisites

Before starting the coding part, make sure that you have set up your Python 3 environment and required packages. If you don’t have an environment set up yet, you can follow this tutorial to set up the Anaconda environment.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

pandas
NumPy

In addition, we will use the following two packages:

Firstly, the gate.io package (package name gate-API) pulls crypto price data from gate.io.
Secondly, we will use the Twitter API library Tweepy to post trading signals via the Twitter API.

You can install packages using console commands:

pip install
conda install (if you are using the anaconda packet manager)

Step #1: Regular Retrieval of Price Data

First, we will define a “prices” class to handle the incoming data flow. The prices class contains a “get_latest_prices” attribute that retrieves price information from gate.io. The function regularly calls the gate.io list_ticker market endpoint.

The list_ticker endpoint returns a list of data fields for cryptocurrency pairs. Examples of price pairs are BTC_USD, BTC_ETH, BTC_ADA, etc. We can limit the response to a single price pair by passing a single pair as a variable in the API call. However, it is not possible to restrict the response to multiple pairs. We either get data for a single pair or all pairs. The response contains a list of the following data fields:

Overview of the data fields in the response

The following code maintains a separate dictionary for each cryptocurrency pair. The dictionary contains the name of the cryptocurrency pair and a data frame that includes the price data history. Each time the crypto bot receives a new response from the API, it goes through the response, extracts the price data(Price, Volume, etc.), and appends this data to the Data Frame of the respective cryptocurrency pair. Then the information is passed to the preprocessing module.

import pandas as pd
import numpy as np
import json
import requests
import datetime as dt
import logging
import threading
import time
from __future__ import print_function

import tweepy 
import gate_api
from gate_api.exceptions import ApiException, GateApiException
from twitter_secrets import twitter_secrets as ts # place the twitter_secrets file under /anaconda3/Lib

class Prices:
    """Class that uses the gate api to retrieve currency data."""

    def __init__(self, config):
        self._config = config
        self._logger = logging.getLogger(__name__)
        configuration = gate_api.Configuration(host="https://api.gateio.ws/api/v4")
        api_client = gate_api.ApiClient(configuration)
        self._api_instance = gate_api.SpotApi(api_client)
        self._price_history = {}
        self._cont_update_thread = None
        self._stop_cont_update_thread = None
        self._price_history_lock = threading.Lock()

    def get_price_history(self):
        """Returns a dictionary with the price histories for the currencies."""
        return self._price_history, self._price_history_lock

    def get_latest_prices(self):
        """Gets new price data and adds the values to a DataFrame.

        Returns the DataFrame in a dictionary with the currencies as keys."""
        timestamp = dt.datetime.now()
        try:
            api_response = self._api_instance.list_tickers()
        except GateApiException as e:
            logging.warning(
                "Gate api exception, label: %s, message: %s\n" % (e.label, e.message)
            )
            return {}
        except ApiException as e:
            logging.warning("Exception when calling SpotApi->list_tickers: %s\n" % e)
            return {}
        latest_prices = {}
        for response in api_response:
            currency = response.currency_pair
            if "USDT" not in currency or "BEAR" in currency:
                continue
            value_dict = {
                "base_volume": pd.to_numeric(response.base_volume),
                "change_percentage": pd.to_numeric(response.change_percentage),
                "etf_leverage": pd.to_numeric(response.etf_leverage),
                "etf_net_value": pd.to_numeric(response.etf_net_value),
                "etf_pre_net_value": pd.to_numeric(response.etf_pre_net_value),
                "etf_pre_timestamp": response.etf_pre_timestamp,
                "high_24h": pd.to_numeric(response.high_24h),
                "highest_bid": pd.to_numeric(response.highest_bid),
                "high_bid": pd.to_numeric(response.highest_bid),
                "last": pd.to_numeric(response.last),
                "low_24h": pd.to_numeric(response.low_24h),
                "lowest_ask": pd.to_numeric(response.lowest_ask),
                "quote_volume": pd.to_numeric(response.quote_volume),
                "timestamp": timestamp,
            }
            latest_prices[currency] = pd.DataFrame(value_dict, index=[1])
        return latest_prices

    def start_cont_update(self):
        self._stop_cont_update_thread = threading.Event()
        self._stop_cont_update_thread.clear()
        self._cont_update_thread = threading.Thread(
            target=self._cont_update,
            args=(
                self._stop_cont_update_thread,
                self._price_history_lock,
            ),
        )
        self._cont_update_thread.start()
        self._logger.info("Started continuous price logging")

    def _cont_update(self, stop_event, lock):
        """Continuously adds new prices to the price history."""
        while not stop_event.is_set():
            start_time = time.time()
            lock.acquire()
            for currency, df in self.get_latest_prices().items():
                if currency in self._price_history.keys():
                    self._price_history[currency] = self._price_history[
                        currency
                    ].append(df, ignore_index=True)
                else:
                    self._price_history[currency] = df
            lock.release()
            self._logger.debug("Currency_dfs updated")
            self._wait_before_update(start_time)

    def _wait_before_update(self, start_time):
        elapsed_time = time.time() - start_time
        self._logger.debug(f"Elapsed time: {elapsed_time}")
        if elapsed_time > self._config["price_update_delay"]:
            delay = 0
            self._logger.warning(
                #"It took longer to retrieve the price data than the update_delay!"
            )
        else:
            delay = self._config["price_update_delay"] - elapsed_time
        self._logger.debug(f"Waiting {delay}s until next update")
        time.sleep(delay)

Step #2: Calculate Indicator Values

Next, we will define a few functions that process the regular data inflow from gate.io and calculate indicator values for the different cryptocurrencies.

Absolute price values signal the bot that the price moves up or down. However, our signaling logic will primarily work with thresholds on percentage values. These indicators have a p at the end of the name in the code below.

In addition, we will avoid misleading signals by incorporating moving averages into the signaling logic. Moving averages work on historical data, so we have to hand over the price history when we call the “calc_indicators” function. Furthermore, we take over other indicators from the data frame, including the 24h_low and the 24h_high. These indicators give us additional information about the indicators of the preceding price points. We can use them to build more robust trading signals.

All indicators are calculated separately for each crypto pair, passed to a dictionary, and then passed to the signaling logic. In the next step, we can use these indicator values in our signaling rules.

def calc_indicators(price_history):
    indicators = {}
    indicators_over_all = calc_indicators_over_all(price_history)
    for currency, df in price_history.items():
        if len(df) <= 2:
            logging.getLogger().debug(
                f"Skipped '{currency} when calculating indicators due to a lack of information"
            )
            continue
        volume = df["base_volume"].iloc[-1]
        last_price = df["last"].iloc[-1]
        moving_avg_price = df["last"].mean()
        moving_average_volume = df["base_volume"].mean()
        moving_average_deviation_percent = np.round(
            div(last_price, moving_avg_price) - 1, 2
        )

        price_before = df["last"].iloc[-2]
        price_delta = last_price - price_before
        price_delta_p = div(price_delta, last_price)
        price_delta_before = price_before - df["last"].iloc[-3]
        price_delta_p_before = div((price_before - df["last"].iloc[-3]), price_before)
        low_24h = df["low_24h"].iloc[-1]
        high_24h = df["high_24h"].iloc[-1]
        low_high_diff_p = div(high_24h - low_24h, low_24h)
        change_percentage = df["change_percentage"].iloc[-1]

        indicator_values = {
            "last_price": last_price,
            "price_before": price_before,
            "volume": volume,
            "moving_avg_price": moving_avg_price,
            "moving_average_volume": moving_average_volume,
            "moving_average_deviation_percent": moving_average_deviation_percent,
            "price_delta_p": price_delta_p,
            "price_delta": price_delta,
            "price_delta_before": price_delta_before,
            "price_delta_p_before": price_delta_p_before,
            "high_24h": high_24h,
            "low_24h": low_24h,
            "low_high_diff_p": low_high_diff_p,
            "change_percentage": change_percentage,
        }
        indicator_values.update(indicators_over_all)
        indicators[currency] = indicator_values
    return indicators


def calc_indicators_over_all(price_history):
    avg_change_p = 0
    for currency, df in price_history.items():
        avg_change_p += df["change_percentage"].iloc[-1]
    nr_of_currencies = len(price_history)
    avg_change_p = div(avg_change_p, nr_of_currencies)
    values = {
        "avg_change_p": avg_change_p,
    }
    return values


def div(dividend, divisor, alt_value=0.0):
    return dividend / divisor if divisor != 0 else alt_value

Step #3: Define the Signaling Logic of the Twitter Bot

Our bot will use a signaling logic that differentiates between the following price signals:

A simple uptick: Price_delta_p must be higher than the threshold (10%) to trigger.
A simple downtick: Price_delta_p must be lower than the threshold (10%) to trigger.
The bot does also report on new 24-hour lows and highs
Another event on which the bot reports is when an up or down price trend begins to accelerate or slows down.
The bot reports that when a price performs a trend reversal (pullback and recovery)

Overview of the different trading signals generated by the signaling logic

Be aware that the price_delta_p measures the percentage deviation from the previous price point. Thus, the signaling logic that our bot has in place is very dependent on the interval in which the bots request new price data. Shorter time intervals will have a lower chance of triggering because more considerable changes typically occur over a longer time. For more details regarding the signaling logic, please view the code below.

def check_signal(currency, indicators, cs_config):
    ind = indicators[currency]
    signal = ''
    if (ind['moving_avg_price'] > 0
            and ind['last_price'] > 0.0
            and abs(ind['price_delta']) > 0.0
            and abs(ind['price_delta_p']) >= cs_config["delta_threshold_p"]
            and ind['volume'] > 0
    ):
        # up
        if ind['price_delta'] > 0:
            movement_type = 'up +'
            if abs(ind['price_delta_p_before']) > cs_config["delta_threshold_p"]:
                if ind['price_delta_before'] <= 0:
                    movement_type = 'recovery from ' + str(ind['price_before']) + ' to ' + str(ind['last_price'])
                else:
                    if ind['price_delta_p'] * (1-cs_config["delta_threshold_p"]) > ind['price_delta_p_before']:
                        movement_type = 'upward trend accelerates +'
                    elif ind['price_delta_p'] < ind['price_delta_p_before'] * (1-cs_config["delta_threshold_p"]):
                        movement_type = 'upward trend slows down +'
                    elif ind['price_delta_p'] * (1+cs_config["delta_threshold_p"]) >= ind['price_delta_p_before'] >= ind['price_delta_p'] * (1-cs_config["delta_threshold_p"]):
                        movement_type = 'upward trend continues +'
        # down
        elif ind['price_delta'] < 0:
            movement_type = 'down '
            if abs(ind['price_delta_p_before']) > cs_config["delta_threshold_p"]:
                if ind['price_delta_before'] > 0:
                    movement_type = 'pullback from ' + str(ind['price_before']) + ' to ' + str(ind['last_price'])
                else:
                    if ind['price_delta_p'] * (1-cs_config["delta_threshold_p"]) > ind['price_delta_p_before']:
                        movement_type = 'down trend accelerates '
                    elif ind['price_delta_p'] * (1+cs_config["delta_threshold_p"]) >= ind['price_delta_p_before'] >= ind['price_delta_p'] * (1-cs_config["delta_threshold_p"]):
                        movement_type = 'down trend continues '
                    elif ind['price_delta_p'] < ind['price_delta_p_before'] * (1+cs_config["delta_threshold_p"]):
                        movement_type = 'downward trend slows down '

        signal = get_signal_log(movement_type, currency, ind['price_delta_p'], ind['last_price'],
                                ind['moving_avg_price'], ind['volume'], ind['price_delta'], ind['change_percentage'],
                                ind['high_24h'], ind['low_24h'], ind['low_high_diff_p'])

        check_24h_peak(currency, ind['last_price'], ind['low_24h'], ind['high_24h'])

    return signal
    # trade_signal


def check_24h_peak(currency, last_price, low_24h, high_24h):
    if last_price < low_24h:
        print(currency + ' new 24h low $' + str(last_price))
    elif last_price > high_24h:
        print(currency + ' new 24h high $' + str(last_price))


def get_signal_log(movement_type, currency, price_delta_p, last_price, moving_avg_price, volume, price_delta,
                   daily_up_p, high_24h, low_24h, low_high_diff_p):
    signal = f'{currency} {movement_type} ' \
             f'{np.round(price_delta_p * 100, 5)}% ' \
             f'MA:${np.round(moving_avg_price, 6)} ' \
             f'last_price:${np.round(last_price, 6)} ' \
             f'price delta:{np.round(price_delta, 6)} ' \
             f'volume:${np.round(volume, 1)} ' \
             f'daily_change:{np.round(daily_up_p, 2)}% ' \
             f'high_24h:${high_24h} ' \
             f'low_24h:${low_24h} ' \
             f'low_high_diff_p:{np.round(low_high_diff_p * 100, 2)}%'
    return signal

Step #4: Send Tweets via Twitter

Next, we define a simple function that calls the Twitter API and tweets our price signal. Because the Twitter API requires authentication, you must provide the API authentication credentials from a valid Twitter developer account.

It’s best not to store the API credentials directly in code. Still not perfect, but slightly better is to keep the data in a separate python file (for example, called “twitter_secrets”) that you put into your package folder (for example, under /anaconda3/Lib), from where you can import it directly into your code.

# Twitter Consumer API keys
CONSUMER_KEY    = "api123"
CONSUMER_SECRET = "api123"

# Twitter Access token & access token secret
ACCESS_TOKEN    = "api123"
ACCESS_SECRET   = "api123"

BEARER_TOKEN = "api123"

class TwitterSecrets:
    """Class that holds Twitter Secrets"""

    def __init__(self):
        self.CONSUMER_KEY    = CONSUMER_KEY
        self.CONSUMER_SECRET = CONSUMER_SECRET
        self.ACCESS_TOKEN    = ACCESS_TOKEN
        self.ACCESS_SECRET   = ACCESS_SECRET
        self.BEARER_TOKEN   = BEARER_TOKEN
        
        # Tests if keys are present
        for key, secret in self.__dict__.items():
            assert secret != "", f"Please provide a valid secret for: {key}"

twitter_secrets = TwitterSecrets()

Once you have imported the file, you can then load the API credentials from the file in the following way:

consumer_key = ts.CONSUMER_KEY
consumer_secret = ts.CONSUMER_SECRET
access_token = ts.ACCESS_TOKEN
access_secret = ts.ACCESS_SECRET

### Print API Auth Data (leave disabled for security reasons)
# print(f'consumer_key: {consumer_key}')
# print(f'consumer_secret: {consumer_secret}')
# print(f'access_token: {access_token}')
# print(f'access_secret: {access_token}')

#authenticating to access the twitter API
auth=tweepy.OAuthHandler(consumer_key,consumer_secret)
auth.set_access_token(access_token,access_secret)
api=tweepy.API(auth)

def send_pricechange_tweet(signal):
    api.update_status(f"{signal} \n {relataly_url}")

Step #5 Starting the Crypto Signal Bot

Finally, we can hit the start button of our crypto signal bot. But before we do this, take a look at some configuration options of the bot.

CYCLE_DELAY is the standard interval in seconds in which the bot will call the gate.io API.
CURRENCY_PAIR is another API parameter limiting the cryptocurrency pairs to specific currency pairs. The bot will scan the entire market with all currency pairs in the standard setting, including all USDT pairs.
TWITTER_ACTIVE defines whether the bot posts signals on Twitter. Be aware that your bot may instantly report any signal on your Twitter account if you enable it.
RUNS defines the max number of prices that the bot will retrieve before the bot stops.

Now, let’s test the bot:

RUNS = 50 # the bot will stop after 50 price points
CYCLE_DELAY = 20 # the interval for checking the data and retrieving another price point
EVAL_PRICES_DELAY = 10
CURRENCY_PAIR = "" # the bot will retrieve data for all currency pairs listed on gate.io
PRICES_CONFIG = {"price_update_delay": 20}
TWITTER_ACTIVE = False

CHECK_SIGNAL_CONFIG = {
    "moving_avg_threshold_down_p": 0.10,
    "moving_avg_threshold_up_p": 0.10,
    "delta_threshold_p": 0.07,
    'enable_twitter': TWITTER_ACTIVE,
}

if __name__ == "__main__":
    logging.basicConfig(
        level=logging.INFO, format="\033[02m%(asctime)s %(levelname)s: %(message)s"
    )
    logger = logging.getLogger(__name__)
    prices = Prices(PRICES_CONFIG)
    prices.start_cont_update()
    currency_dfs = {}
    logging.info(f"Crypto bot is starting - please wait")
    logger.info(f"Collecting crypto data from gate.io for {EVAL_PRICES_DELAY}s")
    time.sleep(EVAL_PRICES_DELAY)
    logger.info(f"\n<< Crypto signal bot started :-) >>")
    logger.info(f"<< Checking prices every {CYCLE_DELAY} seconds >>")
    logger.info(f"Now checking for signals - please wait\n")
    for i in range(RUNS):
        price_history, lock = prices.get_price_history()
        lock.acquire()
        indicators = calc_indicators(price_history)
        lock.release()
        for currency in indicators.keys():
            if not indicators[currency]:
                continue
            signal = check_signal(
                currency,
                indicators,
                CHECK_SIGNAL_CONFIG,
            )
            if signal:
                logger.info(signal)
                if CHECK_SIGNAL_CONFIG['enable_twitter']:
                    send_pricechange_tweet(signal)
                    print('send via twitter')
        time.sleep(CYCLE_DELAY)

2022-03-09 11:40:38,939 INFO: Started continuous price logging
2022-03-09 11:40:38,940 INFO: Crypto bot is starting - please wait
2022-03-09 11:40:38,940 INFO: Collecting crypto data from gate.io for 10s
2022-03-09 11:40:48,941 INFO: 
<< Crypto signal bot started :-) >>
2022-03-09 11:40:48,942 INFO: << Checking prices every 20 seconds >>
2022-03-09 11:40:48,942 INFO: Now checking for signals - please wait

2022-03-09 11:52:06,800 INFO: EOSBULL_USDT up + 19.42446% MA:$1.1e-05 last_price:$1.4e-05 price delta:3e-06 volume:$1272326905.1 daily_change:33.65% high_24h:$1.16e-05 low_24h:$9.8e-06low_high_diff_p:18.37%
EOSBULL_USDT new 24h high $1.39e-05
send via twitter

And this is what the tweets will look like on Twitter:

Summary

Congratulations on completing this tutorial! In this article, you learned how to build a Python-based Twitter crypto signal bot. When run, the bot will regularly retrieve cryptocurrency quotes from the Gate.io exchange and tweet about any price movements based on a simple signaling logic.

While the signaling logic in this tutorial is kept simple, this basic framework provides a foundation for you to further develop and enhance the signaling rules. For example, you could consider using volume or price volatility changes as the basis for defining signals. Have fun experimenting and expanding upon this project!

If you found this article helpful, please show your appreciation by leaving a comment. Cheers

Sources and Further Reading

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

The post Automate Crypto Trading with a Python-Powered Twitter Bot and Gate.io Signals appeared first on relataly.com.

Posting Tweets On Twitter using Python and Tweepy

Florian Follonier — Sun, 09 May 2021 15:14:45 +0000

In a previous article, we have shown how to retrieve social media data via the Twitter API in Python. However, we can do many more cool things with the Twitter API. Another cool thing is interacting with the Twitter user account and posting automated tweets. This article shows how this works. We will use the Twitter API and the Tweepy library to submit tweets to our Twitter account.

A common use case for submitting tweets via the Twitter API is a Twitter bot. Today, many bots on Twitter send automated tweets, for example, about unusual movements in the stock market or other types of events. However, it is worth mentioning that bots are also used for evil purposes, for example, to lure people into scams or influence political opinions.

This tutorial lays the foundation for building a simple Twitter bot with Python and Tweepy. The remainder of this article is structured as follows: We’ll begin by briefly looking at the tweet object on Twitter. Then, we will write some code that requests authentication via the Twitter API and submit some test tweets to the API.

The Tweet Object of the Twitter API

Tweets are the basic building blocks of Twitter. They have several customization options, such as media, hashtags, and emojis. We can use all of these options by specifying respective parameters in our requests to the Twitter API.

First of all, a tweet contains up to 280 characters of text. The text can include hashtags or emojis, which also occupy space in terms of characters.

While hashtags are indicated via the #-sign, emojis are displayed via standard Unicode. Most emojis occupy two characters of the maximum text length, but some may require more. Here you can find an overview of the emoji Unicode.

Optionally, tweets can contain media objects such as images, GIFs, or Polls. We can attach these elements via a separate API function.

Submitting a Twitter Tweet

Implementation: Posting Tweets via the Twitter API in Python

This tutorial shows how to write a short Python script that authenticates against Twitter. Then we will submit some test Tweets using your Twitter account. We will look at two different cases:

Submitting a simple text-only tweet
Submitting a tweet that contains text and a media file

The code is available on the GitHub repository.

View on GitHub Relataly GitHub Repo

Prerequisites

Using the Twitter API requires a Twitter developer account. If you don’t have one yet, you can follow the steps described in this tutorial to create an account for free.

Also, make sure you install all required packages. In this article, we will be working with the following standard packages:

pandas

In addition, we will be using Tweepy. Tweepy is an easy-to-use Python library for accessing the Twitter API.

You can install packages using console commands:

pip install

Step #1: Load Twitter Account Credentials for Authentication

Before interacting with the Twitter API, we must authenticate with our developer account credentials. The developer account is linked to the Twitter account specified during registration. So, when you execute an API request, it will use the associated Twitter user.

Storing the Account Credentials in a Python File

We should not store the user credentials directly in our Python notebooks. Instead, we should use a Python file and import it into our notebook. You can use the following sample file and replace the values with your Twitter API keys, secrets, and tokens.

# Twitter Consumer API keys
CONSUMER_KEY    = "api123"
CONSUMER_SECRET = "api123"

# Twitter Access token & access token secret
ACCESS_TOKEN    = "api123"
ACCESS_SECRET   = "api123"

BEARER_TOKEN = "api123"

class TwitterSecrets:
    """Class that holds Twitter Secrets"""

    def __init__(self):
        self.CONSUMER_KEY    = CONSUMER_KEY
        self.CONSUMER_SECRET = CONSUMER_SECRET
        self.ACCESS_TOKEN    = ACCESS_TOKEN
        self.ACCESS_SECRET   = ACCESS_SECRET
        self.BEARER_TOKEN   = BEARER_TOKEN
        
        # Tests if keys are present
        for key, secret in self.__dict__.items():
            assert secret != "", f"Please provide a valid secret for: {key}"

twitter_secrets = TwitterSecrets()

Once you have stored the API keys in the file and the file in the right folder, you can load the API keys into your Python project with the following code.

import pandas as pd
import tweepy 

# place the twitter_secrets file under /anaconda3/Lib
from twitter_secrets import twitter_secrets as ts

consumer_key = ts.CONSUMER_KEY
consumer_secret = ts.CONSUMER_SECRET
access_token = ts.ACCESS_TOKEN
access_secret = ts.ACCESS_SECRET

### Print API Auth Data (leave disabled for security reasons)
# print(f'consumer_key: {consumer_key}')
# print(f'consumer_secret: {consumer_secret}')
# print(f'access_token: {access_token}')
# print(f'access_secret: {access_token}')

Alternative: Storing the Account Credentials in a YAML File

Alternatively, you can also put the credentials into a YAML file (called “api_config_twitter.yml.” The file should then look as follows, and you can place it in a subfolder “API Keys” in your working directory:

api_key: api123 
api_secret: api123
access_token: api123
access_secret: api123

You can then import the token and access keys with the code below.

## In case you prefer to load the auth data from a local yaml file, use the following code
import yaml
# load the API keys and further data from your local yaml file
# place the api_config_twitter.yml file under /API_Keys/
yaml_file = open('API Keys/api_config_twitter.yml', 'r')  
p = yaml.load(yaml_file, Loader=yaml.FullLoader)

try:
    consumer_key = p['api_key']
    consumer_secret = p['api_secret']
    access_token = p['access_token']
    access_secret = p['access_secret']
except ValueError: 
        print('error')

Step #2: Request User Authentication via the API

When you have the auth data available in your project, you can authenticate against the Twitter API.

#authenticating to access the twitter API
auth=tweepy.OAuthHandler(consumer_key,consumer_secret)
auth.set_access_token(access_token,access_secret)
api=tweepy.API(auth)

Step #3: Post a Text-only Tweet on Twitter

Once we successfully authenticate at the Twitter API, we can interact with our Twitter user account. The code below will submit a test tweet via Twitter API. As you can see, we also indicated two hashtags.

# Define the tweet text
tweet='this is an automated test tweet using #Python $BTC $ETH'

# Generate text tweet
api.update_status(tweet)

Once you run the code, the tweet will immediately appear in the feed of our Twitter account:

A simple test tweet sent via the Twitter API

Step #4: Include Mediafiles in Tweets via the API

We can also include media files such as photos and videos in our tweets. For this case, Tweety provides a separate function called “update_with_media.” This function takes two arguments: the image path and tweet_text.

Before running the code below, you need to change the image_path to reference an image file on your computer.

# Define the tweet text
tweet_text='This is another automated test tweet using #Python'
image_path ='Test Images/ETH_price.png'

# Generate text tweet with media (image)
status = api.update_with_media(image_path, tweet_text)

Et voilà: Another Tweet has appeared on our Twitter Account. This time, the post includes the sample text and a media file.

Twitter tweet with an image attached

Summary

This article has shown how you can use Tweepy and Python to submit tweets via the Twitter API. You have learned to authenticate against the Twitter API and submit tweets containing text and media files.

Understanding how to interact with Twitter via the API is essential when creating a Twitter bot. I have written another article about creating a Twitter signaling bot that analyzes financial data and tweets about relevant price movements. If you want to learn more about this topic, check out this article on Generating crypto trading signals in Python

Sources and Further Reading

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

The post Posting Tweets On Twitter using Python and Tweepy appeared first on relataly.com.

Streaming Tweets and Images via the Twitter API in Python

Florian Follonier — Sun, 03 Jan 2021 21:46:47 +0000

Twitter is a rich source of data that can be used to understand current and future trends. Because tweets often include hashtags, they can be easily linked to specific contexts such as political discussions or financial instruments. This makes Twitter a valuable tool for collecting and analyzing data. In this article, we’ll demonstrate how to use Python to access Twitter data via the Twitter API v2. We’ll show how to extract tweets, process them, and use them to gain insights and make predictions. Whether you’re a data scientist, a business analyst, or a social media enthusiast, this tutorial will provide you with the tools you need to work with Twitter data in Python.

This article shows two specific cases:

Example A: Streaming Tweets and Storing the Data in a DataFrame
Example B: Streaming Images for a specific channel and storing them in a local directory

If you are new to APIs, consider first familiarizing yourself with the basics of REST APIs.

The rest of this article is structured as follows: First, we’ll look at how to sign up to use the Twitter API and obtain an authentication token. We will then look at the object model of Twitter and use the security token in our requests to the Twitter API. Then we will turn to the two examples, A & B.

Also: Accessing Remote Data Sources via REST APIs in Python

Twitter data is a playground for data scientists.

Basics of the Twitter API

Twitter data has a variety of applications. For example, we can analyze tweets to discover trends or evaluate sentiment on a topic. Furthermore, images embedded in tweets and hashtags can train image recognition models or validate them. Thus, knowing how to obtain data via the Twitter API can be helpful if you are doing data science.

API Versions

Twitter provides two different API versions: The two versions have their documentation and are incompatible. While the API v1.1 is still more established, Twitter API v2 offers more options for fetching data from Twitter. For example, it allows tailoring the fields given back with the response, which can be helpful if the goal is to minimize traffic. The Twitter API v2 is currently in early access mode, but it will sooner or later become the new standard API in the market. Therefore, I decided to base this tutorial on the newer v2-version.

Overview of Twitter APIs (Source: Twitter)

Twitter API Documentation

When working with the Twitter API v2, it is vital to understand the Twitter object model. The tweet object acts as the parent of four subobjects: according to the API documentation, the basic building block of Twitter is the Tweet object. It has various fields attached, such as the tweet text, created_at, and tweet id. The Twitter API documentation provides a complete list of these root-level fields. The standard API response does not include most of the areas. If we want to retrieve additional fields, we need to specify these fields in the request rules.

Each object, in turn, has multiple fields for which we specify which fields to return in the rule, as with the Tweet Object. This article uses the tweet object and the media object, which contains all the media (e.g., images or videos) that tweets can have attached.

Functioning of the Recent Search Endpoint

In this tutorial, we will be working with the Twitter Recent Search Endpoint. There are also other API endpoints, but covering all of them would go beyond the scope of this article. One notable feature of the Recent Search endpoint is that we can’t retrieve the data directly using GET requests but first have to send a POST request to the API specifying which information we want to fetch. To change these rules, we first have to delete them with a POST request and then pass the new ruleset to the API with another POST request. This procedure may sound complicated, but it gives the user more control over the API.

Different API Models

We can use the “Recent Search Endpoint” in batch and streaming modes. In batch mode, the endpoint returns a list of tweets once. If the stream option is enabled, the API returns a continuous flow of individual tweets, plus any new tweets as they are published to Twitter. In this way, we can stream and process tweets in (almost) real-time. In this tutorial, we will work with the streaming option enabled.

Stream Mode vs. Batch Mode

Filters

We can limit the tweets and fields that the API includes in the response by specifying parameters. For example, we can let the API know that we want to retrieve tweets with specific keywords or in a certain period or only those tweets with images attached. The API documentation provides a list of all filter parameters.

Twitter Search-API Python Examples

To stream tweets from Twitter, you will need to use the Twitter API. The API allows developers to access Twitter’s data and functionality, including the ability to stream real-time tweets. In order to stream tweets, you will need to sign up for a Twitter developer account and obtain the necessary credentials, such as a consumer key and access token. Once you have these credentials, you can use them to authenticate your API requests and access the streaming endpoint for tweets.

Setup a Twitter Developer Account

Using the Twitter API requires you to have your own Twitter developer account. If you don’t have an account yet, you need to create it on the Twitter developer page. As of Jan 2021, the standard developer account is free and comes with a limit of 500.000 tweets that you can fetch per month.

After logging into your developer account, go to the developer dashboard page and create a new project with a name of your choice. Once you have created a project, it will be shown in the “projects dashboard,” along with an overview of your monthly tweet usage. In the next section, you will retrieve your API key from the project.

Obtaining your Twitter API Security Key

The Twitter API accepts our requests only if we provide a personal Bearer token for authentication. Each project has its Bearer token. You can find the bearer token in the Developer Portal under the Authentication Token section. Store the token somewhere in between. In the next step, we will store it in a secure location.

Twitter API Authentication Tokens

Storing and Loading API Tokens

The Twitter API requires the user to authenticate during use by providing a secret token. It is best not to store these keys in your project but to put them separately in a safe place. In a production environment, you would, of course, want to decrypt the keys. However, it should be sufficient to store the key in a separate python file for our test case.

Create a new Python file called “twitter_secrets.py” and fill in the following code. Then replace the Bearer_Key with the key you retrieved from the Twitter Developer portal in the previous step.

In the following, create a Python file called “twitter_secrets.py” and fill in the code below:

"""Replace the values below with your own Twitter API Tokens"""

# Twitter Bearer Token
BEARER_KEY = "your own BEARER KEY"


class TwitterSecrets:
    """Class that holds Twitter Secrets"""

    def __init__(self):
        self.BEARER_KEY = BEARER_KEY
        
        # Tests if keys are present
        for key, secret in self.__dict__.items():
            assert secret != "", f"Please provide a valid secret for: {key}"

twitter_secrets = TwitterSecrets()

Then replace the Bearer_Key with the key you retrieved from the Twitter Developer portal in the previous step.

The twitter_screts.py has to go to the package library of your python environment. If you use anaconda under Windows, the path is typically: \anaconda3\Lib. Once you have placed the file in your python library, you can import it into your python project and use the bearer token from the import, as shown below:

# imports the twitter_secrets python file in which we store the twitter API keys
from twitter_secrets import twitter_secrets as ts

bearer_token = ts.BEARER_TOKEN

Prerequisites

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

pandas

You can install packages using console commands:

pip install
conda install (if you are using the anaconda packet manager)

Example A: Streaming Tweets via the Twitter Recent Search Endpoint

In the first use case, we will first define some simple filter rules and then request tweets from the API based on these rules. As a response, the API returns a stream of tweets which we will process further. We store the text from the tweets in a DataFrame and further tweet information.

We won’t detail all the code components, but we will go through the most important functions with inline code. The code is available on the GitHub repository.

View on GitHub Relataly GitHub Repo

Step #1: Define Functions to Interact with the Twitter API

We begin by defining functions to interact with the Twitter API.

import requests 
import json 
import pandas as pd

# imports the twitter_secrets python file in which we store the twitter API keys
from twitter_secrets import twitter_secrets as ts

# a function that provides a bearer token to the API
def create_headers(bearer_token):
    headers = {"Authorization": "Bearer {}".format(bearer_token)}
    return headers

# this function defines the rules on what tweets to pull    
def set_rules(headers, delete, bearer_token, rules):
    payload = {"add": rules}
    response = requests.post(
        "https://api.twitter.com/2/tweets/search/stream/rules",
        headers=headers,
        json=payload,
    )
    if response.status_code != 201:
        raise Exception(
            "Cannot add rules (HTTP {}): {}".format(response.status_code, response.text)
        )
    print(json.dumps(response.json()))
    
# this function requests the current rules in place
def get_rules(headers, bearer_token):
    response = requests.get(
        "https://api.twitter.com/2/tweets/search/stream/rules", headers=headers
    )
    if response.status_code != 200:
        raise Exception(
            "Cannot get rules (HTTP {}): {}".format(response.status_code, response.text)
        )
    print(json.dumps(response.json()))
    return response.json()

# this function resets all rules
def delete_all_rules(headers, bearer_token, rules):
    if rules is None or "data" not in rules:
        return None

    ids = list(map(lambda rule: rule["id"], rules["data"]))
    payload = {"delete": {"ids": ids}}
    response = requests.post(
        "https://api.twitter.com/2/tweets/search/stream/rules",
        headers=headers,
        json=payload
    )
    if response.status_code != 200:
        raise Exception(
            "Cannot delete rules (HTTP {}): {}".format(
                response.status_code, response.text
            )
        )
    print(json.dumps(response.json()))

# this function starts the stream
def get_stream(headers, set, bearer_token, expansions, fields, save_to_disk, save_path):
    data = []
    response = requests.get(
        "https://api.twitter.com/2/tweets/search/stream" + expansions + fields, headers=headers, stream=True,
    )
    print(response.status_code)
    if response.status_code != 200:
        raise Exception(
            "Cannot get stream (HTTP {}): {}".format(
                response.status_code, response.text
            )
        )
    i = 0
    for response_line in response.iter_lines():
        i += 1
        if i == max_results:
            break
        else:
            json_response = json.loads(response_line)
            #print(json.dumps(json_response, indent=4, sort_keys=True))
            try:
                save_tweets(json_response)
                if save_to_disk == True:
                    save_media_to_disk(json_response, save_path)
            except (json.JSONDecodeError, KeyError) as err:
                # In case the JSON fails to decode, we skip this tweet
                print(f"{i}/{max_results}: ERROR: encountered a problem with a line of data... \n")
                continue

# this function saves a tweet to the SQLite DB                
def save_tweets(tweet):
    print(json.dumps(tweet, indent=4, sort_keys=True))
    data = tweet['data']
    public_metrics = data['public_metrics']
    tweet_list.append([data['id'], data['author_id'], data['created_at'], data['text'], public_metrics['like_count']])

Next, we subscribe to a stream of tweets. Once you have subscribed to the stream, you can process the received tweets as needed, such as by filtering or storing them for further analysis.

In this example, we will simply save the data to disk and append it to a text file. Tweets may have media files attached. If you also like to save these images to disk, you can set the save_media_to_disk variable to “True.”

# the max number of tweets that will be returned
max_results = 20

# save to disk
save_media_to_disk = False
save_path = ""

# You can adjust the rules if needed
search_rules = [
    {"value": "dog has:images", "tag": "dog pictures", "lang": "en"},
    {"value": "cat has:images -grumpy", "tag": "cat pictures", "lang": "en"},
]
tweet_fields = "?tweet.fields=attachments,author_id,created_at,public_metrics"
expansions = ""
tweet_list = []


bearer_token = ts.BEARER_TOKEN
headers = create_headers(bearer_token)
rules = get_rules(headers, bearer_token)
delete = delete_all_rules(headers, bearer_token, rules)
set = set_rules(headers, delete, bearer_token, search_rules)
get_stream(headers, set, bearer_token, expansions, tweet_fields, save_media_to_disk, save_path)

df = pd.DataFrame (tweet_list, columns = ['tweetid', 'author_id' , 'created_at', 'text', 'like_count'])
df

Example B: Streaming Images from Twitter to Disk

The second use case is streaming image data from Twitter. Twitter images are useful in various machine learning use cases, e.g., training models for image recognition and classification.

To be able to use the images later, we save them directly to our local drive. To do this, we reuse several functions from the first use case. We add some functions for creating the folder structure in which we then store the images. You can also find the code for this example on Github. The code is available on the GitHub repository.

View on GitHub Relataly GitHub Repo

Step #1: Define Functions to Interact with the Twitter API

import requests 
import json 
import pandas as pd
import urllib
import os
from os import path
from datetime import datetime as dt

# imports the twitter_secrets python file in which we store the twitter API keys
from twitter_secrets import twitter_secrets as ts

def create_headers(bearer_token):
    headers = {"Authorization": "Bearer {}".format(bearer_token)}
    return headers
        
def set_rules(headers, delete, bearer_token, rules):
    payload = {"add": rules}
    response = requests.post(
        "https://api.twitter.com/2/tweets/search/stream/rules",
        headers=headers,
        json=payload,
    )
    if response.status_code != 201:
        raise Exception(
            "Cannot add rules (HTTP {}): {}".format(response.status_code, response.text)
        )
    print(json.dumps(response.json()))
    
def get_rules(headers, bearer_token):
    response = requests.get(
        "https://api.twitter.com/2/tweets/search/stream/rules", headers=headers
    )
    if response.status_code != 200:
        raise Exception(
            "Cannot get rules (HTTP {}): {}".format(response.status_code, response.text)
        )
    print(json.dumps(response.json()))
    return response.json()

def delete_all_rules(headers, bearer_token, rules):
    if rules is None or "data" not in rules:
        return None

    ids = list(map(lambda rule: rule["id"], rules["data"]))
    payload = {"delete": {"ids": ids}}
    response = requests.post(
        "https://api.twitter.com/2/tweets/search/stream/rules",
        headers=headers,
        json=payload
    )
    if response.status_code != 200:
        raise Exception(
            "Cannot delete rules (HTTP {}): {}".format(
                response.status_code, response.text
            )
        )
    print(json.dumps(response.json()))

def get_stream(headers, set, bearer_token, expansions, fields, save_to_disk, save_path):
    data = []
    response = requests.get(
        "https://api.twitter.com/2/tweets/search/stream" + expansions + fields, headers=headers, stream=True,
    )
    print(response.status_code)
    if response.status_code != 200:
        raise Exception(
            "Cannot get stream (HTTP {}): {}".format(
                response.status_code, response.text
            )
        )
    i = 0
    for response_line in response.iter_lines():
        i += 1
        if i == max_results:
            break
        else:
            json_response = json.loads(response_line)
            #print(json.dumps(json_response, indent=4, sort_keys=True))
            try:
                save_tweets(json_response)
                if save_to_disk == True:
                    save_media_to_disk(json_response, save_path)
            except (json.JSONDecodeError, KeyError) as err:
                # In case the JSON fails to decode, we skip this tweet
                print(f"{i}/{max_results}: ERROR: encountered a problem with a line of data... \n")
                continue
                
def save_tweets(tweet):
    #print(json.dumps(tweet, indent=4, sort_keys=True))
    data = tweet['data']
    includes = tweet['includes']
    media = includes['media']
    for line in media:
        tweet_list.append([data['id'], line['url']])  
        
def save_media_to_disk(tweet, save_path):
    data = tweet['data']
    #print(json.dumps(data, indent=4, sort_keys=True))
    includes = tweet['includes']
    media = includes['media']
    for line in media:
        media_url = line['url']
        media_key = line['media_key']
        pic = urllib.request.urlopen(media_url)
        file_path = save_path + "/" + media_key + ".jpg"
        try:
            with open(file_path, 'wb') as localFile:
                localFile.write(pic.read())
            tweet_list.append(media_key, media_url)
        except Exception as e:
            print('exception when saving media url ' + media_url + ' to path: ' + file_path)
            if path.exists(file_path):
                print("path exists")
    
def createDir(save_path):
    try:
        os.makedirs(save_path)
    except OSError:
        print ("Creation of the directory %s failed" % save_path)
        if path.exists(savepath):
            print("file already exists")
    else:
        print ("Successfully created the directory %s " % save_path)

Step #2: Define the Folder Structure to Store the Images

We want to store images contained in tweets on disk. To find these images again afterward, we create a new directory for each run.

# save to disk
save_to_disk = True
 
if save_to_disk == True: 
    # detect the current working directory and print it
    base_path = os.getcwd()
    print ("The current working directory is %s" % base_path)
    img_dir = '/twitter/downloaded_media/'
    # the write path in which the data will be stored. If it does not yet exist, it will be created
    now = dt.now()
    dt_string = now.strftime("%d%m%Y-%H%M%S")# ddmmYY-HMS
    save_path = base_path + img_dir + dt_string
    createDir(save_path)

Finally, we call the Twitter API and subscribe to the Streaming Service. We store the tweet id and the preview image URL in a DataFrame (df).

# the max number of tweets that will be returned
max_results = 10

# You can adjust the rules if needed
search_rules = [
    {"value": "dog has:images", "tag": "dog pictures", "lang": "en"},
]

media_fields = "&media.fields=duration_ms,height,media_key,preview_image_url,public_metrics,type,url,width"
expansions = "?expansions=attachments.media_keys"
tweet_list = []

bearer_token = ts.BEARER_TOKEN
headers = create_headers(bearer_token)
rules = get_rules(headers, bearer_token)
delete = delete_all_rules(headers, bearer_token, rules)
set = set_rules(headers, delete, bearer_token, search_rules)
get_stream(headers, set, bearer_token, expansions, media_fields, save_to_disk, save_path)

df = pd.DataFrame (tweet_list, columns = ['tweetid', 'preview_image_url'])
df

Summary

In this tutorial, you learned how to stream and process Twitter data in near real-time using the Twitter API v2 with two use cases. The first use case has shown requesting tweet text and how to store it in a DataFrame. In the second case, we have streamed images and saved them to a local directory. There are many more ways to interact with the Twitter API, but it’s already possible to implement some exciting projects based on these two cases.

If you liked this post, leave a comment. And if you want to learn more about using the Twitter API with Python, consider checking out my other articles:

Sources and Further Reading

https://developer.twitter.com/en/docs/twitter-api A part of the presented Python code stems from the Twitter API documentation and has been modified to fit the purpose of this article.

The post Streaming Tweets and Images via the Twitter API in Python appeared first on relataly.com.

Training a Sentiment Classifier with Naive Bayes and Logistic Regression in Python

Florian Follonier — Sat, 20 Jun 2020 21:49:05 +0000

Are you ready to learn about the exciting world of social media sentiment analysis using Python? In this article, we’ll dive into how companies are leveraging machine learning to extract insights from Twitter comments, and how you can do the same. By comparing two popular classification models – Naive Bayes and Logistic Regression – we’ll help you identify which one best fits your needs.

Businesses are using sentiment analysis to make better sense of the vast amounts of data available online and on social media platforms. Understanding customer opinions and feedback can help companies identify trends and make more informed decisions. Whether you’re a business professional looking to leverage the power of social media data or a machine learning enthusiast, this article has everything you need to get started.

We’ll begin with an introduction to the concept of sentiment analysis and its theoretical foundations. Then, we’ll guide you through the practical steps of implementing a sentiment classifier in Python. Our model will analyze text snippets and categorize them into one of three sentiment categories: “positive,” “neutral,” or “negative.” Finally, we’ll compare the performance of Naive Bayes and Logistic Regression classifiers.

By the end of this article, you’ll have the skills and knowledge to perform sentiment analysis on social media data and apply these insights to your business or personal projects. So let’s jump right in!

Sentiment analysis has various use cases from analyzing social media to reviewing customer feedback in call centers.

Also: Classifying Purchase Intention of Online Shoppers with Python

What is Sentiment Analysis?

Sentiment analysis is the process of identifying the sentiment, or emotional tone, of a piece of text. This can be useful for a wide range of applications, such as identifying customer sentiment towards a product or service, or detecting the overall sentiment of a social media post or news article.

Sentiment analysis is typically performed using natural language processing (NLP) techniques and machine learning algorithms. These tools allow computers to “understand” the meaning of text and identify the sentiment it contains. Sentiment analysis can be performed at various levels of granularity, from identifying the sentiment of an entire document to identifying the sentiment of individual words or phrases within a document.

How Sentiment Classification Works

A sentiment classifier with three classes

There are many different approaches to sentiment analysis, and the specific methods used can vary depending on the specific application and the type of text being analyzed. Some common techniques for performing sentiment analysis include using machine learning algorithms to classify text as positive, negative, or neutral, and using lexicons, or lists of words with pre-defined sentiment, to identify the sentiment of individual words or phrases. In this way, it is possible to measure the emotions towards a specific topic, e.g., products, brands, political parties, services, or trends.

We can show how sentiment analysis works with a simple example:

“This product is excellent!”
“I don’t like this ice cream at all.”
“Yesterday, I’ve seen a dolphin.”

While the first sentence denotes a positive sentiment, the second sentence is negative, and in the third sentence, the sentiment is neutral. A sentiment classifier can automatically label these sentences:

Text Sequence	Sentiment Label
This product is great!	POSITIVE
I wouldn’t say I like this ice cream at all.	NEGATIVE
Yesterday I saw a dolphin.	NEUTRAL

Sentiment Labels of Text Sequences

Predicting sentiment classes opens the door to more advanced statistical analysis and automated text processing.

Use Cases for Sentiment Analysis

Sentiment analysis is used in various application domains, including the following:

Sentiment analysis can lead to more efficient customer service by prioritizing customer requests. For example, when customers complain about services or products, an algorithm can identify and prioritize these messages so that sales agents answer them first. This can increase customer satisfaction and reduce the churn rate.
Twitter and Amazon reviews have become the first port of call for many customers today when exchanging information about products, brands, and trends or expressing their own opinions. A sentiment classifier systematically enables businesses to evaluate this information. It can collect data from social media posts and product reviews in real-time. For example, marketing managers can quickly obtain feedback on how well customers perceive campaigns and ads.
In stock market prediction, analyze the sentiment of social media or news feeds towards stocks or brands. The sentiment is then used as an additional feature alongside price data to create better forecasting models. Some forecasting also approaches exclusively rely on sentiment.

Sentiment Analysis will find further adoption in the coming years. Especially in marketing and customer service, companies will increasingly use sentiment analysis to automate business processes and offer their customers a better customer experience.

How Sentiment Analysis Works: Feature Modelling

An essential step in the development of the Sentiment Classifier is language modeling. Before we can train a machine learning model, we need to bring the natural text into a structured format that the model can statistically assess in the training process. Various modeling techniques exist for this purpose. The two most common models are bag-of-words and n-grams.

Also: 9 Powerful Applications of OpenAI’s ChatGPT and Davinci

Bag-of-word Model

The bag-of-word model calculates probability distributions over the number of unique words. This approach converts individual words into individual features. Fill words with low predictive power, such as “the” or “a,” will be filtered out. Consider the following text sample:

“Bob likes to play basketball. But his friend Daniel prefers to play soccer. “

Through filtering of fill words, we convert his sample to:

“Bob”, “likes”, “play”, “basketball”, “friend”, “Daniel”, “play”, “soccer”.

In the next step, the algorithm converts these words into a normalized form, where each word becomes a column:

Text sample after transformation

The bag-of-word model is easy to implement. However, it does not consider grammar or word order.

What is an N-gram Model?

The n-gram model considers multiple consecutive words in a text sequence and thus captures word sequence. The n stands for the number of words considered.

For example, in a 2-gram model, the sentence “Bob likes to play basketball. But his friend Daniel prefers to play soccer.” will be converted to the following model:

“Bob likes,” “likes to,” “to play,” “play basketball,” and so on. The n-gram model is often used to supplement the bag-of-word model. It is also possible to combine different n-gram models. For a 3-gram model, the text would be converted to “Bob likes to,” “likes to play,” “to play basketball,” and so on. Combining multiple n-gram models, however, can quickly increase model complexity.

Sentiment Classes and Model Training

The training of sentiment classifiers traditionally takes place in a supervised learning process. For this purpose, a training data set is used, which contains text sections with associated sentiment tendencies as prediction labels. Depending on which labels we provide and the training data, the classifier will learn to predict sentiment on a more or less fine-grained scale. Capturing neutral sentiment requires choosing an odd number of classes.

More advanced classifiers can detect different sorts of emotions and, for example, detect whether someone expresses anger, happiness, sadness, and so on. It basically comes down to which prediction labels you provide with the training data.

When the classifier is trained on a one-gram model, the classifier will learn that certain words such as “good” or “great” increase the probability that a text is associated with a positive sentiment. Consequently, when the classifier encounters these words in a new text sample, it will predict a higher probability of positive sentiment. On the other hand, the classifier will learn that words such as “hate” or “dislike” are often used to express negative opinions and thus increase the probability of negative sentiment.

Language Complications

Is sentiment analysis that simple? Well, not quite. The cases described so far were deliberately chosen to be very simple. However, human language is very complex, and many peculiarities make it more difficult in practice to identify the sentiment in a sentence or paragraph. Here are some examples:

Inversions: “this product is not so great.”
Typos: “I live this product!”
Comparisons: “Product a is better than product z.”
In a text passage, expression of pros and cons: “An advantage is that. But on the other hand…”
Unknown vocabulary: “This product is just whuopii!”
Missing words: “How can you not this product?”

Fortunately, there are methods to solve the complications mentioned above. I will explain more about them in one of my future articles. But for now, let’s stay with the basics and implement a simple classifier.

Training a Sentiment Classifier Using Twitter Data in Python

Venturing into the practical aspects of sentiment classification, our aim in this tutorial is to create an efficient sentiment classifier. Our focus will be on a dataset provided by Kaggle, comprising tens of thousands of tweets, each categorized as positive, neutral, or negative.

Our objective is to design a classifier capable of assigning one of these three sentiment categories to new text sequences. To this end, we will employ two distinct algorithms – Logistic Regression and Naive Bayes – as our estimators.

The tutorial culminates with a comparative analysis of the prediction performance of both models, followed by a set of test predictions. Through this hands-on approach, you will gain an understanding of the nuances of sentiment classification and its application in understanding public opinion, especially on social media platforms like Twitter.

Boost your sentiment analysis skills with our step-by-step guide, and learn to leverage machine learning tools for precise sentiment prediction.

The code is available on the GitHub repository.

View on GitHub Relataly GitHub Repo

Prerequisites

Before starting the coding part, make sure that you have set up your Python 3 environment and required packages. If you don’t have an environment, follow this tutorial to set up the Anaconda environment. Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

In addition, we will be using the machine learning libraries scikit-learn and seaborn for visualization.

You can install packages using console commands:

pip install
conda install (if you are using the anaconda packet manager)

About the Sentiment Dataset

Let’s begin with the technical part. First, we will download the data from the Twitter sentiment example on Kaggle.com. If you are working with the Kaggle Python environment, you can also directly save the data into your Python project.

We will only use the following two CSV files:

train.csv: contains 27480 text samples.
test.csv: contains 3533 text samples for validation purposes

The two files contain four columns:

textID: An identifier
text: The raw text
selected_text: Contains a selected part of the original text
sentiment: Contains the prediction label

We will copy the two files (train.csv and test.csv) into a folder that you can access from your Python environment. For simplicity, I recommend putting these files directly into the folder of your Python notebook. If you put them somewhere else, don’t forget to adjust the file path when loading the data.

Step #1 Load the Data

Assuming that you have copied the files into your Python environment, the next step is to load the data into your Python project and convert it into a Pandas DataFrame. The following code performs these steps and then prints a data summary.

import math 
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import matplotlib

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, multilabel_confusion_matrix
import scikitplot as skplt

import seaborn as sns

# Load the train data
train_path = "train.csv"
train_df = pd.read_csv(train_path) 

# Load the test data
sub_test_path = "test.csv"
test_df = pd.read_csv(sub_test_path) 

# Print a Summary of the data
print(train_df.shape, test_df.shape)
print(train_df.head(5))

			textID		text												selected_text									sentiment
0			cb774db0d1	I`d have responded, if I were going					I`d have responded, if I were going				neutral
1			549e992a42	Sooo SAD I will miss you here in San Diego!!!		Sooo SAD										negative
2			088c60f138	my boss is bullying me...							bullying me										negative
3			9642c003ef	what interview! leave me alone						leave me alone									negative
4			358bd9e861	Sons of ****, why couldn`t they put them on t...	Sons of ****,									negative
...	
27481 rows × 4 columns

Step #2 Clean and Preprocess the Data

Next, let’s quickly clean and preprocess the data. First, as a best practice, we will transform the sentiment labels of the train and the test data into numeric values.

In addition, we will add a column in which we store the length of the text samples.

Three-class sentiment scale

# Define Class Integer Values
cleanup_nums = {"sentiment": {"negative": 1, "neutral": 2, "positive": 3}}

# Replace the Classes with Integer Values
train_df = train_base_df.copy()
train_df.replace(cleanup_nums, inplace=True)

# Clean the Test Data
test_df = test_base_df.copy()
test_df.replace(cleanup_nums, inplace=True)

# Create a Feature based on Text Length
train_df['text_length'] = train_df['text'].str.len() # Store string length of each sample
train_df = train_df.sort_values(['text_length'], ascending=True)
train_df = train_df.dropna()
train_df

			textID			text			selected_text	sentiment	text_length
14339		5c6abc28a1		ow				ow				2			3.0
26005		0b3fe0ca78		?				?				2			3.0
11524		4105b6a05d		aw				aw				2			3.0
641			5210cc55ae		no				no				2			3.0
25699		ee8ee67cb3		ME				ME				2			3.0
...

Step #3 Explore the Data

It’s always good to check the label distribution for a potential imbalance. We do this by plotting the distribution of labels in the text samples. This is important because it helps ensure that the trained model can make accurate predictions on new data. If the class labels are unbalanced, then the model is more likely to be biased toward the more common classes, which can lead to poor performance on less common classes.

Also: Feature Engineering and Selection for Regression Models

# Print the Distribution of Sentiment Labels
sns.set_theme(style="whitegrid")
ax = train_df['sentiment'].value_counts(sort=False).plot(kind='barh', color='b')
ax.set_xlabel('Count')
ax.set_ylabel('Labels')

As we can see, our data is a bit imbalanced, but the differences are still within an acceptable range.

Let’s also quickly take a look at the distribution of text length.

# Visualize a distribution of text_length
sns.histplot(data=train_df, x='text_length', bins='auto', color='darkblue');
plt.title('Text Length Distribution')

Step #4 Train a Sentiment Classifier

Next, we will prepare the data and train a classification model. We will use the pipeline class of the scikit-learn framework and a bag-of-word model to keep things simple. In NLP, we typically have to transform and split up the text into sentences and words. The pipeline class is thus instrumental in NLP because it allows us to perform multiple actions on the same data in a row.

The pipeline contains transformation activities and a prediction algorithm, the final estimator. In the following, we create two pipelines that use two different prediction algorithms:

Logistic Regression
Naive Bayes

4a) Sentiment Classification using Logistic Regression

The first model that we will train uses the logistic regression algorithm. We create a new pipeline. Then we add two transformers and the logistic regression estimator. The pipeline will perform the following activities.

CountVectorizer: The vectorizer counts the number of words in each text sequence and creates the bag-of-word models.
TfidfTransformer: The “Term Frequency Transformer” scales down the impact of words that occur very often in the training data and are thus less informative for the estimator than words that occur in a smaller fraction of the text samples. Examples are words such as “to” or “a.”
Logistic Regression: By defining the multi_class as ‘auto,’ we will use logistic regression in a one-vs-all approach. This approach will split our three-class prediction problem into two two-class problems. Our model differentiates between one class and all other classes in the first step. Then all observations that do not fall into the first class enter a second model that predicts whether it is class two or three.

Our pipeline will transform the data and fit the logistic regression model to the training data. After executing the pipeline, we will directly evaluate the model’s performance. We will do this by defining a function that generates predictions on the test dataset and then evaluating the performance of our model. The function will print the performance results and store them in a dataframe. Later, when we want to compare the models, we can access the results from the dataframe.

# Create a transformation pipeline
# The pipeline sequentially applies a list of transforms and as a final estimator logistic regression 
pipeline_log = Pipeline([
                ('count', CountVectorizer()),
                ('tfidf', TfidfTransformer()),
                ('clf', LogisticRegression(solver='liblinear', multi_class='auto')),
        ])

# Train model using the created sklearn pipeline
model_name = 'logistic regression classifier'
model_lgr = pipeline_log.fit(train_df['text'], train_df['sentiment'])

def evaluate_results(model, test_df):
    # Predict class labels using the learner function
    test_df['pred'] = model.predict(test_df['text'])
    y_true = test_df['sentiment']
    y_pred = test_df['pred']
    target_names = ['negative', 'neutral', 'positive']

    # Print the Confusion Matrix
    results_log = classification_report(y_true, y_pred, target_names=target_names, output_dict=True)
    results_df_log = pd.DataFrame(results_log).transpose()
    print(results_df_log)
    matrix = confusion_matrix(y_true,  y_pred)
    sns.heatmap(pd.DataFrame(matrix), 
                annot=True, fmt="d", linewidths=.5, cmap="YlGnBu")
    plt.xlabel('Predictions')
    plt.xlabel('Actual')
    
    model_score = score(y_pred, y_true, average='macro')
    return model_score

    
# Evaluate model performance
model_score = evaluate_results(model_lgr, test_df)
performance_df = pd.DataFrame().append({'model_name': model_name, 
                                    'f1_score': model_score[0], 
                                    'precision': model_score[1], 
                                    'recall': model_score[2]}, ignore_index=True)

4b) Sentiment Classification using Naive Bayes

We will reuse the code from the last step to create another pipeline. However, we will exchange the Logistic Regressor with Naive Bayes (“MultinomialNB”). Naive Bayes is commonly used in natural language processing. The algorithm calculates the probability of each tag for a text sequence and then outputs the tag with the highest score. For example, the probabilities of the appearance of the words “likes” and “good” in texts within the category “positive sentiment” are higher than the probabilities of formation within the “negative” or “neutral” categories. In this way, the model predicts how likely it is for an unknown text that contains those words to be associated with either category.

We will reuse the previously defined function to print a classification report and plot the results in a confusion matrix.

# Create a pipeline which transforms phrases into normalized feature vectors and uses a bayes estimator
model_name = 'bayes classifier'

pipeline_bayes = Pipeline([
                ('count', CountVectorizer()),
                ('tfidf', TfidfTransformer()),
                ('gnb', MultinomialNB()),
                ])

# Train model using the created sklearn pipeline
model_bayes = pipeline_bayes.fit(train_df['text'], train_df['sentiment'])

# Evaluate model performance
model_score = evaluate_results(model_bayes, test_df)
performance_df = performance_df.append({'model_name': model_name, 
                                    'f1_score': model_score[0], 
                                    'precision': model_score[1], 
                                    'recall': model_score[2]}, ignore_index=True)

Step #5 Measuring Multi-class Performance

So which classifier achieved better performance? It’s not so easy to say because it depends on the metrics. We will compare the classification performance of our two classifiers using the following metrics:

Accuracy is calculated as the ratio between correctly predicted observations and total observations.
Precision is calculated as the ratio between correctly labeled values and the sum of the correctly and incorrectly labeled positive observations.
The formula for Recall is the ratio between correctly predicted observations and the sum of falsely classified observations.
F1-Score takes all falsely labeled observations into account. It is, therefore, useful when you have an unequal class distribution.

You may wonder which of our three classes is the positive class. The answer is that we have to determine the positive class ourselves. By defining the positive class, we can consider that some classes may be more important than others. The other classes will then be counted as negative. You can see this in the confusion matrix in sections 5 and 6, containing separate metrics for each label.

Another option is to define a weighted average (see confusion matrix) that weights the quantity of the different labels in the overall dataset. For example, the negative label is weighted a bit higher than the neutral label because fewer observations with negative and positive labels are present in the data. Because our classes are equally important, I decided to use the weighted average.

Step #6 Comparing Model Performance

The following code calculates the performance metrics for the two classifiers and then creates a barplot to illustrate the results. In this specific case, the recall equals the accuracy.

If you want to learn more about measuring classification performance, check out this article.

# Compare model performance
print(performance_df)

performance_df = performance_df.sort_values('model_name')
fig, ax = plt.subplots(figsize=(12, 4))
tidy = performance_df.melt(id_vars='model_name').rename(columns=str.title)
sns.barplot(y='Model_Name', x='Value', hue='Variable', data=tidy, ax=ax, palette='husl',  linewidth=1, edgecolor="w")
plt.title('Model Outlier Detection Performance (Macro)')

So we see that our Logistic Regression model performs slightly better than the Naive Bayes model. Of course, there are still many possibilities to improve the models further. In addition, there are several other methods and algorithms with which the performance could be significantly increased.

Step #7 Make Test Predictions

Finally, we use the Bayes classifier to generate some test predictions. Feel free to try it out! Change the text in the text phrases array and convince yourself that the classifier works.

testphrases = ['Mondays just suck!', 'I love this product', 'That is a tree', 'Terrible service']
for testphrase in testphrases:
    resultx = model_lgr.predict([testphrase]) # use model_bayes for predictions with the other model
    dict = {1: 'Negative', 2: 'Neutral', 3: 'Positive'}
    print(testphrase + '-> ' + dict[resultx[0]])

Mondays suck!-> Negative
I love this product-> Positive
That is a tree-> Neutral
Terrible service-> Negative

Summary

That’s it! In this tutorial, you have learned to build a simple sentiment classifier that can detect sentiment expressed through text on a three-class scale. We have trained and tested two standard classification algorithms – Logistic Regression and Naive Bayes. Finally, we have compared the performance of the two algorithms and made some test predictions.

The best way to deepen your knowledge of sentiment analysis is to apply it in practice. I thus want to encourage you to use your knowledge by tackling other NLP challenges. For example, you could build a sentiment classifier that assigns text phrases to labels such as sports, fashion, cars, technology, etc. If you are still looking for data you can use for such a project, you will find exciting ones on Kaggle.com.

Let me know if you found this tutorial helpful. I appreciate your feedback!

Sources and Further Reading

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

The post Training a Sentiment Classifier with Naive Bayes and Logistic Regression in Python appeared first on relataly.com.

Social Media Data Archives - relataly.com

Automate Crypto Trading with a Python-Powered Twitter Bot and Gate.io Signals

Different Modules of the Signal Bot

About the APIs Used in this Tutorial

The Gate.io API

The Twitter API

Storing the Twitter API Key

Implementing a Twitter Signal Bot using Python

Python Prerequisites

Step #1: Regular Retrieval of Price Data

Step #2: Calculate Indicator Values

Step #3: Define the Signaling Logic of the Twitter Bot

Step #4: Send Tweets via Twitter

Step #5 Starting the Crypto Signal Bot

Summary

Sources and Further Reading

Posting Tweets On Twitter using Python and Tweepy

The Tweet Object of the Twitter API

Implementation: Posting Tweets via the Twitter API in Python

Prerequisites

Step #1: Load Twitter Account Credentials for Authentication

Storing the Account Credentials in a Python File

Alternative: Storing the Account Credentials in a YAML File

Step #2: Request User Authentication via the API

Step #3: Post a Text-only Tweet on Twitter

Step #4: Include Mediafiles in Tweets via the API

Summary

Sources and Further Reading

Streaming Tweets and Images via the Twitter API in Python

Basics of the Twitter API

API Versions

Twitter API Documentation

Functioning of the Recent Search Endpoint

Different API Models

Filters

Twitter Search-API Python Examples

Setup a Twitter Developer Account

Obtaining your Twitter API Security Key

Storing and Loading API Tokens

Prerequisites

Example A: Streaming Tweets via the Twitter Recent Search Endpoint

Step #1: Define Functions to Interact with the Twitter API

Step #2: Subscribe to the Tweet Streaming Service

Example B: Streaming Images from Twitter to Disk

Step #1: Define Functions to Interact with the Twitter API

Step #2: Define the Folder Structure to Store the Images

Step #3: Subscribe to the Tweet Streaming Service

Summary

Sources and Further Reading

Training a Sentiment Classifier with Naive Bayes and Logistic Regression in Python

What is Sentiment Analysis?

How Sentiment Classification Works

Use Cases for Sentiment Analysis

How Sentiment Analysis Works: Feature Modelling

Bag-of-word Model

What is an N-gram Model?

Sentiment Classes and Model Training

Language Complications

Training a Sentiment Classifier Using Twitter Data in Python

Prerequisites

About the Sentiment Dataset

Step #1 Load the Data

Step #2 Clean and Preprocess the Data

Step #3 Explore the Data

Step #4 Train a Sentiment Classifier

4a) Sentiment Classification using Logistic Regression

4b) Sentiment Classification using Naive Bayes

Step #5 Measuring Multi-class Performance

Step #6 Comparing Model Performance

Step #7 Make Test Predictions

Summary

Sources and Further Reading