See also the archived case studies from the Statistical Society of Canada for many interesting (and challenging) datasets from diverse fields at: http://ssc.ca/en/meetings/archived-case-studies
05-29-2015 12:02 PMWow, what a treasure-trove!
06-18-2015 01:03 AMHi Jennifer, I am just a student of SAS learning by my own, by reading Mr. Cody's book named "Learning SAS by examples". book says that data sets and programs can be downloaded from http://support.sas.com/cody . But I could not find it, Can you please help me to look for it. Thanks, Ali
06-18-2015 09:55 AMHi Zuliqar, the link to the data and programs is a bit difficult to find. Here it is: http://support.sas.com/publishing/bbu/zip/62857.zip Comment back here if you're looking for something different!
06-18-2015 02:13 PM
Hee hee hee - lots and lots of data for me to play with!! Thank you so much for posting this!!
12-08-2015 07:43 PMI came across a blog post containing a list of 20 Big Data Repositories to check out and thought it would be good to add the link here.
12-09-2015 11:06 AMWow, @MichelleHomes, this is awesome! Thank you for sharing.
01-21-2016 07:47 AMAnother resource that may be of interest are the Kaggle Datasets.
01-21-2016 10:03 AMThis is great, @MichelleHomes! I'll add this to the article above and mention it on the AU board. Got a lead yesterday on a huge stash of Yahoo datasets. We're awash in open data!



@BeverlyBrown is that a challenge? LOL

Wow, I've often wasted my time looking in vain for suitable datasets to use for motivating marketing analysis case studies in my university teaching, but prospecting through this collection has really got me some great hits in a fairly short time. I'm really impressed and grateful to SAS and the community for curating this. Well Done (SAS) U!
03-04-2016 01:00 PMI appreciate that affirmation, @Damien_Mather! Credit my predecessor, @jennifers_sas, who started this stash. Here's a collection I just ran across, with data sets organized topically. government, sports, health, weather, space, etc. Keep using it, we'll keep adding to it!

Hi Free Data Fans!
I realised I had a great repository of links bookmarked (many of them ones people have posted here) and I wanted to add to the collection. I've tried to make sure I haven't duplicated any of the ones already mentioned. Enjoy!
http://open.canada.ca/en - Canadian government’s open data sets; I’ve been using these for a while and always impressed with the huge variety.
Open Data Toronto – Born and raised in the city of Toronto, Ontario, this is definitely a treasure trove of great stuff. Some aggregated / summarized, some raw, I’ve been using many of these in the Free Data Friday series and in numerous presentations I’ve given.
http://datausa.io - More focused on the visualization of data, but there are some pretty interesting datasets you can get down to. Does take some time to navigate though, and not all have data that can be downloaded.
http://www.uscampgrounds.info/takeit.html - If you want to play with Geographic data, this is a great site; they’ve done a lot of work compiling Campgrounds in North America, and if you are an avid camper, you can easily import the data into your GPS.
https://data.nasa.gov - Another site I used data from for a Free Data Friday (on meteorites!) – a lot of the datasets mean nothing to me, but every so often I find a new one that’s really cool. I’ve also found some pretty cool images, which make great wallpaper!
https://data.gov.uk/data/search - Jumping outside of North America, the UK has done a phenomenal job at compiling a wide range of data sources, in a multitude of formats.
http://opendataforafrica.org - African datasets, including Port statistics, poverty, oil production, and a huge assortment of other topics. The main page has links to different countries that also offer open data, so if you’re looking for a specific region chances are good there will be something useful.
http://www.opensourcesports.com - Hockey, baseball, football, soccer - even cycling, rugby, and swimming! This website has a ton of data on a ton of sports.
http://www.seanlahman.com/baseball-archive/statistics/ - This database is *huge* and allows for exploration into topics like joining multiple tables, not-so-big big data, and time series analysis. I absolutely love this dataset, and use it as often as I can.
07-07-2016 07:36 PMThought the community may like to know about MIT's new visualization tool - a goldmine for data nerds!


@MichelleHomes well there goes my weekend - thanks
07-07-2016 07:47 PMI knew you'd appreciate it @DarthPathos! Have a fun analytical datavizzy weekend.

@MichelleHomes between you and @BeverlyBrown you two are keeping me very happy!! 8-)
07-08-2016 06:59 AM"Analytical datavizzy weekend." Epic! @MichelleHomes and @DarthPathos, your comments made me notice the other data sources @DarthPathos posted in this thread in April. Wow. thanks! @Damien_Mather: Did you see that?
11-11-2016 08:22 AMYes, thinking of you @DarthPathos and your weekend again

@MichelleHomes LOL I already saw your tweet and already sent it to myself thanks for sharing and hope you have a great weekend!
11-11-2016 08:38 AM@MichelleHomes I appreciate your latest contribution, which benefits anyone looking for practice data and guaranteed to distract @DarthPathos for hours!


@BeverlyBrown @MichelleHomes Hahah - distracting me is definitely not a pro. SQUIRREL. Oh chocolate!! Oh hey something's shiny over there
11-11-2016 08:46 AMThe joys of sharing with the awesome side effect of keeping @DarthPathos occupied! LOL. Have a wonderful weekend guys.

More great teaching data links, thanks team!
I'm loath to post anything negative about free data, especially when we get it for free, but I'd like to reflect some comments from my advanced business analytics class just finished this semester back to the community:
Much of the kaggle data seemed so heavily anonymized to them so as to be unusable for many of their learning and research opportunities.
Maybe this is due in part to the marketing (and therefore customer) orientation of my course, but it seems to me that some datasets are 'overanonymised' if that is indeed a word.
I am keenly aware that useability, as a function of anonymity, for public data is an 'ideal point' (inverted parabolic or 'U' ) relationship, and that without anonymity we would have much less data available, and different domains generally have different views on what constitutes sufficient anonymity, and different protocols to ensure adequate privacy.
Does anyone have a feel for characterizing some of the other teaching data sources on an anonymity scale?


Data privacy is a topic that fascinates me (I work in a hospital, so it's critical for what i do). I would love to discuss this and provide my thoughts but I'm exhausted and need to get some work done. Please post back and we'll keep chatting
Have a great evening

OK, you asked for it..
Anonymity in New Zealand and Australia for.
Primary research data:
As a university researchers I always promise to store securely, limit access to a small group of named reseachers, analyse and summarise so that individual participants cannot be identified, then eventually destroy after a fixed number of years, all the primary survey and qualitative data I collect. As far as I know, this is mandatory if I want to get ethical approval for my research from my university, and ethical approval for primary research data gathering is in turn mandatory. So nobody outside the prior-nominated researcher group gets to see the data, anonymised or otherwise. Sometimes for removal of potential researcher bias the data is also anonymised for some researchers.
Secondary internal data:
There are commercial customer privacy laws that prevents firms from collecting and retaining customer data unless they can demonstrate that is part of their legitimate business activity, whilst legitimate business activity is understood to include any analysis that is designed to generate improvements in customer value and experience. Generally improvements in customer experience and value can in turn be tied to improvements in business efficiency and profitability. Analysing internal secondary data for anti-competitive and market-dominance-maintaining reasons is explicitly prohibited on penalty of severe fines, as is lax data firewall and handling security. For this type of data to be made publicly available for general learning, an Australasian firm would have to assume responsibility for the effectiveness of any anonymising process, to ensure its customers cannot ever be identified.
So I can sort of understand why corporate legal councel insist on data anonymising protocols that err on the conservative side, i.e. go over the top, before any is released into the wild.
Secondary external data:
This is where it gets a little tricky for me.
Data and metadata that is largely or primarily machine-generated, and does not explicitly identify individual users, like google page stats for selected government public websites, should be OK to analyse and reproduce unless explicitly protected by an agreed protocol, not that any come to mind apart from Creative Commons variations. However I'm never entirely sure that someone smarter than me might be able to identify users somehow.
Any data in the public domain generated by someone typing something, like I am doing right now, is generally protected by author copyright, which means, at a minimum, any explicit reproduction should acknowledge authorship during the period of copyright, and, as in the case of Creative Commons, other protocols should be adhered to. Reproduction and explicit CC or other protocols aside, the data and metadata should be fairly available for analysis, which generally summarises the heck out of what is generally big corpora and structured data and metadata. But when we type, do we ever stop and think how someone might be openly and honestly attempting to analyse what we write, either specifically identifying us, or in bulk with others' creative output? As an academic, I do**, but how widespread is that consciousness? If writers are generally not aware, or limit their awareness to specific domains, how do we, as analysts, ensure their moral rights are fairly protected?
Who analyses the analysts?
If you were tired before reading this, you'll be asleep by now.
** I watch my own bibliographic metadata like a hawk to see who is citing me and to what extent!

Wow, @Damien_Mather that was impressive - and a great way for me to spend my Friday night. I'm sitting here watching Supernatural, eating chocolate ice cream, reading SAS stuff and waiting for my wife to get home ;-).
For me, I have worked for Canada's major telecomm company, 2 hospitals (1 currently) and the Ministry of Health in Ontario. In every job, I have had full access to significantly sensitive material - from customer address, phone bills, and so on to full medical histories (including social work and psychiatric notes). I have taken great pride in my job, and always approach my position that if my data, or that of my parents, were in the database, how would i want it handled? I am highly involved in research at the hospital, and have implemented policies and audits to ensure our data is as protected as possible.
Deidentified data is absolutely critical - but so is data with full PHI, when it's handled appropriately. One of my favourite types of analyses is geospatial cluster analysis - I always manage to find some truly interesting findings when I do this, and the only way I can do it is with postal codes. Bringing age, income etc. into the equation obviously increases the level of identifiable information but could potentially increase the value of the findings exponentially. When i send out reports, I use screen shots to ensure no identifiable information is sent. When I worked at the first hospital, we had a breach because someone sent out a number of pivot tables in Excel without realising that the PHI was still accessible (Excel has hidden sheets with the data, which enables the pivot tables to be manipulated).
Ok - enough rambling from me. All this to say that I have felt for a long time that anyone working with data should be held to a standard (such as the American Statistical Association) where ethics and integrity etc. are clearly outlined, and methods are detailed for handling situations that arise.

Thanks for the food for thought - and here I was thinking I was going to sleep tonight