Problem:

Before a single Texas Lottery Scratch-off ticket is sold, the probability of a winning ticket is known.  Unfortunately, as the game progresses and tickets are purchased (winning and losing), the probability of a customer purchasing a winning ticket changes.  Using openly published information from the TX lottery commission, we are possibly able to gain a better idea of the current probability of a winning ticket for all Texas lottery scratch off games.

Background:

For this analysis we will be using some basic probability properties.  In general, the number of  winning tickets, divided by the total number of available tickets, is the probability of a ticket being a winner.  But, unless we are the first person to buy a scratch-off, we don't want to know the overall probability at the start of the game.  For the vast majority of players, they want/need/should know the probability of the ticket they are about to purchase.  We will  denote this as P(T), where T is the ticket number they are purchasing.  So, if we have N tickets at the start of the game, and W winning tickets, the probability of the first customer who purchases 1 ticket, winning is:

When T=1, P(1) = W / N

Now we have N-1 tickets in play, and our analysis must take into account the previous results:

1) If the first ticket was a winner, we now have N-1 tickets in play, as well as W-1 winning tickets in play.  We can compute P(2) as:

P(2) = (W - 1) / (N - 1)

2) If the first ticket was a losing ticket, we now have N-1 tickets in play, but we still have W winning tickets so we can compute P(2) as:

P(2) = W / (N - 1)

Regardless of the 1 ticket's outcome,

If you run a few more examples by hand, you'll see that, in general:

P(T) = (W - # of winning tickets played) / (N - # of losing tickets played - # of winning tickets played)

Now that we have a general formula for the probability of purchasing a winning texas scratch-off ticket, let's try to optimize our playing strategy.

Winning Strategies

First off, we want to win.  It's intuitive to see that the more winning tickets we buy, the more money we will win.  The more losing tickets we buy, the more money we will lose.  For this game, we want to buy more winning tickets than losing tickets.  Knowing the probability of a winning ticket allows us to theoretically pick a 'better' ticket.  Therefore, without any other information, we will want to purchase tickets with a higher probability of winning.

Assumptions

When we talk about the probability of purchasing a winning lottery ticket, we make a few dangerous assumptions.  We first have to define that there are two types of tickets: winning and losing.  Since the lowest prize is generally (there are definitely exceptions to this rule) the face value of a ticket, any ticket that wins will get us our money back.  I would consider this a win, even though we haven't gained any money; but we haven't lost anything either.

We then have to assume that there is an even distribution of those tickets throughout the state of texas.  To prove this point we would have to empirically know where all winning scratch-off tickets were purchased.  Since we don't have the funds, or the ability to purchase all scratch-offs from all vendors in the state of TX, we'll have to just assume that winning and losing tickets are evenly geographically distributed, thus making our local testing, relevant.

We also have to assume that all tickets are purchased at the same rate.  This one is hard to really gauge since it effectively removes any of the marketing and design work that goes into tickets.  The price point and the layout/prize/design all have to be ignored for now.  For now, based on the data we can collect, we have to say that a ticket, is a ticket, is a ticket.

Real-time Data
Sorry, but I will no longer be providing real time TX Lottery Data.

Analysis

The csv file includes one line for each game/prize combination.  For each, it tells us the number of winning tickets already collected since those are reported back to the TX lottery commission.  Unfortunately, we do not receive the number of losing tickets purchased, since the trash can doesn't report back to the TX lottery commission.

Using our earlier assumptions, and based on the information we can collect, we compute the number of total winning tickets minus the number of cashed winning tickets, divided by the total number of tickets minus the number of cached winning tickets.  This calculation is missing the subtraction of the total number of losing tickets in the denominator, but if we assume that all game tickets are purchased at the same rate, we can effectively ignore this.  Alternatively, we could assume a number of tickets purchased daily, and based on the game's start date, guess how many tickets have been played.

Flaws

This strategy may be easy to rely on, but it's definitely has it's flaws.  As of now, we are unable to count losing tickets effectively.  We also don't know the rate of play for each game or prize level/ticket price.  It's possible that the cheaper tickets are sold at a much more rapid rate than the $20 tickets.  That actually might be a safe assumption, but one that we'll have to table for the moment.

Expansion

Later on, I'd like to figure out a way to charting the # of winning tickets which could indicate the rate of ticket sales, maybe.  This again is a flawed assumption but perhaps looking at all this data together could prove to be useful.

If users reported when they played losing tickets for each game, our stats would be much better.  Consider a mobile app that users scanned the ticket's barcode, and reported on losers.  For their time and energy, they'd be given access to charts and graphs helping them pick better scratch-off games.

Based on previously stored data, try to see a trend in game length vs winning ticket redemptions.

And after compiling all the data, the best game might not be the one with the best probability but a combination of probability and trends.